Friday 21 June 2013

Suricata cuda engine re-designed


Suricata's current dev master includes a re-designed cuda architecture.  There's some history on Suricata's relationship with cuda which I will address in a different post.  Let's talk about the current cuda support.

Suricata performs various cpu intensive talks.  One of those being the pattern matching phase.  The engine has various pattern matching phases - packet, stream, application layer buffers, etc..  The heaviest of them all would be the packet and the stream ones.  With our version we have offloaded the packet mpm phase.  We use the aho-corasick algorithm on the gpu(more on the technical aspects in the coming posts).

These are the first and last commits in sequence of commits that introduces the feature.  A git log should reveal other related commits -

            commit 0e0218f089f87c82041bb56b497bd88a41d31040
            Author: Anoop Saldanha <anoopsaldanha@gmail.com>
            Date:   Thu Jun 20 23:26:23 2013 +0530

                Minor cosmetic changes to the cuda code.


            commit 276716bdd39f325209e03f94b855f0ed14b3b12a
            Author: Anoop Saldanha <poonaatsoc@gmail.com>
            Date:   Wed Aug 1 14:22:49 2012 +0530

                update cuda API wrappers ...



Currently the only supported interface/runmode is pcap-file/autofp.  Do note that we don't support live rule swap when cuda's being used.

* Configuring suricata to use Cuda

To enable and use the cuda accelerated pattern matcher, configure with --enable-cuda.

Next the new mpm that uses the gpu is called "ac-cuda".  You will have to modify the "mpm-algo" parameter in the conf to use it -
"mpm-algo: ac-cuda"

Other customizable options in the conf file -
cuda:
  mpm:
    data-buffer-size-min-limit: 0
    data-buffer-size-max-limit: 1500
    cudabuffer-buffer-size: 500mb
    gpu-transfer-size: 50mb
    batching-timeout: 2000
    device-id: 0
    cuda-streams: 2

The conf file has a brief explanation for each parameter.  Let me explain them again
  • gpu-transfer-size: Probably the most important parameter at this point, since a lot of you all won't be able to run cuda with the default value of 50mb(depending on the card you are using).  This parameter basically configures the max size of the buffer(one holding all the packets) that would be sent to the gpu for processing.  Based on the available memory on your card you will have to play with this value until you get cuda up and running on your system.
  • data-buffer-size-min-limit: The minimum payload size to be sent to the gpu.  For example, if we have it set to 10, all payloads < 10 would be run on the cpu, >= 10 on the gpu.  A value of 0 sets no limit.
  • data-buffer-size-max-limit: The maximum payload size to be sent to the gpu.  Similar to the previous one
  • batching-timeout: This parameter sets the timer for batching packets before being sent to the gpu.  Do note this parameter is in microseconds.  For example, if you have a value of 10000(10ms) set, the cpu would buffer packets for the next 10ms before sending them to the gpu.  You will have to play around with this value to find the performance sweet spot.
  • device-id: The device to use.  If you have multiple devices you can specify the device to use here.  You can use suricata --list-cuda-cards to list the configured cards on your system.
  • cuda-streams: Unused
  • cudabuffer-buffer-size: Internally we use a circular buffer to batch packets.  This parameter specifies the size of this circular buffer.  The buffer allocated using this parameter is page-locked.
You'll have to also increase the "max-pending-packets" parameter in the conf.  I have mine set to 65000.
max-pending-packets: 65000

Also do note that we only support pcap-file/autofp at this moment.

Looks like we are done.  Time to take it for a drive.

* Performance

* Card used - GTX 480 - 15 multiprocessors, with 448 cores and Cuda Compatibility 2.0.
* Host PC: AMD 620 quad core at 2.8Ghz, with 6GB of ram.
* Ruleset being tested - etpro without decoder rules.
* max-pending-packets: 65000
* batching-timeout: 10000

I have managed to run it against some pcaps and the gpu has been faster or as fast as the cpu on all occasions.  The alerts are intact on all runs.  Here are the results -

* time in seconds
Pcap_Name - CPU     :   GPU      : % Increase in Performance
Pcap            - 12.5     :   9.4        : 24%
Pcap            - 18.2     :   14.2      : 22%
Pcap            - 11        :   8.4        : 23%
Pcap            - 7.4       :   5.7        : 23%
Pcap            - 12.8     :   9.9        : 22%
Pcap            - 5.2       :   3.9        : 25%
Pcap            - 18.7     :   14.0      : 25%
Pcap            - 28.4     :   20         : 29%
Pcap            - 13.3     :   10.2      : 15%
Pcap            - 25.9     :   18.2      : 30%
Pcap            - 27.9     :   20.1      : 28%
Pcap            - 29.5     :   21.2      : 28%
Pcap            - 29.7     :   21.5      : 27%
Pcap            - 17.3     :   12.9      : 25%
Pcap            - 23.3     :   18.0      : 23%
Pcap            -  5.8      :   5.45      :  6%
Pcap            -  83       :   72         : 13%
Pcap            - 10440  :   9575     : 8%
Pcap            - 7445    :   7172     : 3.7%
Pcap            - 340      :   271       : 20%
Pcap            - 604      :   603       : -
Pcap            - 1480    :   1452     : 1.8%
Pcap            - 16.1     :   16.1      : 0%
Pcap            - 12.9     :   12.5      : 3%
Pcap            - 6.7       :   6.3        : 3%
Pcap            - 6.7       :   5.9        : 12%
Pcap            - 8.1       :   7.3        : 11%
Pcap            - 3.7       :   3.3        : 11%
Pcap            - 9.3       :   8.85      : 5%
Pcap            - 27.5     :   27.5      : 0%
Pcap            - 16.1     :   17.4      : 9%


Please note the code is experimental, and we would love to your hear your feedback on the performance, card you are running, conf settings used and alert accuracy.


* Future Work
  • Use other features provided by cuda, the immediate ones being streams and texture memory.
  • Provide live mode support.
  • Explore the possibility of sending other buffers to the gpu.
  • Explore other cpu intensive tasks that can be offloaded to the gpu.

In the upcoming posts we will discuss suricata's cuda history and the code development that took place for each cuda version we implemented.  We will also discuss the technical aspects behind the current cuda code.

Friday 17 May 2013

Suricata transaction engine re-designed - Increased performance, better accuracy.


For quite sometime we wanted to improve the stateful detection engine inside suricata.  The previous detection engine although worked fine in a way, had its issues, some of them being these -

  • Repeated inspection of same app state, transactions included, and as a side effect of this we had the pattern matching engine carrying out redundant runs on already matched buffers.
  • FPs resulting from re-inspection of state.
  • FNs.
  • FPs from cross transaction matching.  For example the following sig would FP with the old engine for the below scenario.  Assume we have a flow with 2 requests -
        GET /one.html HTTP/1.1
        GET /two.html HTTP/1.1

        alert http any any -> any any (content:"one"; http_uri; content:"two"; http_uri; sid:1;)


We have now re-designed the way we carry out inspection(master branch), and all the above issues mentioned have disappeared, along with noticeably improved performance.

The corresponding commits being -

commit b0f014124dbf44829ba04ed9d090ff268f7cb0ae
Author: Anoop Saldanha <anoopsaldanha@gmail.com>
Date:   Fri May 3 20:34:58 2013 +0530

    Transaction engine redesigned.....

commit e71de3f98f713fd4fe6cbccf42c51e59b0fca848
Author: Anoop Saldanha <anoopsaldanha@gmail.com>
Date:   Fri May 3 10:03:48 2013 +0530

    Track transaction progress separately......

commit 6ebd360c225ccffab0ec65099e4f0b4882945b25
Author: Anoop Saldanha <anoopsaldanha@gmail.com>
Date:   Fri Apr 12 13:18:17 2013 +0530

    hsbd mpm and packet mpm share same mpm ctx id.....

Here are the stats from a private pcap containing 12826 http requests and which alerts 4033 times with the old engine.

1. The no of times the pattern matching engine was called on buffers

----------------------------------------
mpm - old engine : new engine
----------------------------------------
uri - 179k : 13k
http client body - 222 : 173
http header toserver - 179k : 13k
http header toclient - 174k : 117k
http method - 143k : 10.5k
http cookie - 10.5k : 6.5k
http raw uri - 143k : 10.5k

As you can see the pattern matching runs has drastically reduced.

2. Alert accuracy has been improved, with FPs and FN's disappearing.


-Default yaml-

Old engine - 40.5 seconds
New engine - 33.5 seconds
Performance increase  - 17.28%

I then modified the HOME_NET and EXTERNAL_NET to any, to increase the flows inspected by the engine, and to also increase the no of alerts.  These are the numbers obtained -

-Modified yaml-

Old engine - 70.5
New engine - 51.5
Performance increase - 27%

Suricata users with http heavy traffic and a fairly http heavy ruleset, should see the increase in performance as well.


Please do note the code update requires some rigorous testing, so keep an eye out for missed alerts, segvs and other bugs.  Any form of testing for alert, stability, and performance is appreciated.

=Future Work=

Effort continues to further improve the detection engine for better performance and a lot of cool new features.  Keep an eye out for our mailing list.