Friday, 21 June 2013

Suricata cuda engine re-designed


Suricata's current dev master includes a re-designed cuda architecture.  There's some history on Suricata's relationship with cuda which I will address in a different post.  Let's talk about the current cuda support.

Suricata performs various cpu intensive talks.  One of those being the pattern matching phase.  The engine has various pattern matching phases - packet, stream, application layer buffers, etc..  The heaviest of them all would be the packet and the stream ones.  With our version we have offloaded the packet mpm phase.  We use the aho-corasick algorithm on the gpu(more on the technical aspects in the coming posts).

These are the first and last commits in sequence of commits that introduces the feature.  A git log should reveal other related commits -

            commit 0e0218f089f87c82041bb56b497bd88a41d31040
            Author: Anoop Saldanha <anoopsaldanha@gmail.com>
            Date:   Thu Jun 20 23:26:23 2013 +0530

                Minor cosmetic changes to the cuda code.


            commit 276716bdd39f325209e03f94b855f0ed14b3b12a
            Author: Anoop Saldanha <poonaatsoc@gmail.com>
            Date:   Wed Aug 1 14:22:49 2012 +0530

                update cuda API wrappers ...



Currently the only supported interface/runmode is pcap-file/autofp.  Do note that we don't support live rule swap when cuda's being used.

* Configuring suricata to use Cuda

To enable and use the cuda accelerated pattern matcher, configure with --enable-cuda.

Next the new mpm that uses the gpu is called "ac-cuda".  You will have to modify the "mpm-algo" parameter in the conf to use it -
"mpm-algo: ac-cuda"

Other customizable options in the conf file -
cuda:
  mpm:
    data-buffer-size-min-limit: 0
    data-buffer-size-max-limit: 1500
    cudabuffer-buffer-size: 500mb
    gpu-transfer-size: 50mb
    batching-timeout: 2000
    device-id: 0
    cuda-streams: 2

The conf file has a brief explanation for each parameter.  Let me explain them again
  • gpu-transfer-size: Probably the most important parameter at this point, since a lot of you all won't be able to run cuda with the default value of 50mb(depending on the card you are using).  This parameter basically configures the max size of the buffer(one holding all the packets) that would be sent to the gpu for processing.  Based on the available memory on your card you will have to play with this value until you get cuda up and running on your system.
  • data-buffer-size-min-limit: The minimum payload size to be sent to the gpu.  For example, if we have it set to 10, all payloads < 10 would be run on the cpu, >= 10 on the gpu.  A value of 0 sets no limit.
  • data-buffer-size-max-limit: The maximum payload size to be sent to the gpu.  Similar to the previous one
  • batching-timeout: This parameter sets the timer for batching packets before being sent to the gpu.  Do note this parameter is in microseconds.  For example, if you have a value of 10000(10ms) set, the cpu would buffer packets for the next 10ms before sending them to the gpu.  You will have to play around with this value to find the performance sweet spot.
  • device-id: The device to use.  If you have multiple devices you can specify the device to use here.  You can use suricata --list-cuda-cards to list the configured cards on your system.
  • cuda-streams: Unused
  • cudabuffer-buffer-size: Internally we use a circular buffer to batch packets.  This parameter specifies the size of this circular buffer.  The buffer allocated using this parameter is page-locked.
You'll have to also increase the "max-pending-packets" parameter in the conf.  I have mine set to 65000.
max-pending-packets: 65000

Also do note that we only support pcap-file/autofp at this moment.

Looks like we are done.  Time to take it for a drive.

* Performance

* Card used - GTX 480 - 15 multiprocessors, with 448 cores and Cuda Compatibility 2.0.
* Host PC: AMD 620 quad core at 2.8Ghz, with 6GB of ram.
* Ruleset being tested - etpro without decoder rules.
* max-pending-packets: 65000
* batching-timeout: 10000

I have managed to run it against some pcaps and the gpu has been faster or as fast as the cpu on all occasions.  The alerts are intact on all runs.  Here are the results -

* time in seconds
Pcap_Name - CPU     :   GPU      : % Increase in Performance
Pcap            - 12.5     :   9.4        : 24%
Pcap            - 18.2     :   14.2      : 22%
Pcap            - 11        :   8.4        : 23%
Pcap            - 7.4       :   5.7        : 23%
Pcap            - 12.8     :   9.9        : 22%
Pcap            - 5.2       :   3.9        : 25%
Pcap            - 18.7     :   14.0      : 25%
Pcap            - 28.4     :   20         : 29%
Pcap            - 13.3     :   10.2      : 15%
Pcap            - 25.9     :   18.2      : 30%
Pcap            - 27.9     :   20.1      : 28%
Pcap            - 29.5     :   21.2      : 28%
Pcap            - 29.7     :   21.5      : 27%
Pcap            - 17.3     :   12.9      : 25%
Pcap            - 23.3     :   18.0      : 23%
Pcap            -  5.8      :   5.45      :  6%
Pcap            -  83       :   72         : 13%
Pcap            - 10440  :   9575     : 8%
Pcap            - 7445    :   7172     : 3.7%
Pcap            - 340      :   271       : 20%
Pcap            - 604      :   603       : -
Pcap            - 1480    :   1452     : 1.8%
Pcap            - 16.1     :   16.1      : 0%
Pcap            - 12.9     :   12.5      : 3%
Pcap            - 6.7       :   6.3        : 3%
Pcap            - 6.7       :   5.9        : 12%
Pcap            - 8.1       :   7.3        : 11%
Pcap            - 3.7       :   3.3        : 11%
Pcap            - 9.3       :   8.85      : 5%
Pcap            - 27.5     :   27.5      : 0%
Pcap            - 16.1     :   17.4      : 9%


Please note the code is experimental, and we would love to your hear your feedback on the performance, card you are running, conf settings used and alert accuracy.


* Future Work
  • Use other features provided by cuda, the immediate ones being streams and texture memory.
  • Provide live mode support.
  • Explore the possibility of sending other buffers to the gpu.
  • Explore other cpu intensive tasks that can be offloaded to the gpu.

In the upcoming posts we will discuss suricata's cuda history and the code development that took place for each cuda version we implemented.  We will also discuss the technical aspects behind the current cuda code.

5 comments:

  1. Tips for OpenCL developers in general + anyone who is planning to port Suricata CUDA to use OpenCL -
    http://www.poona.me/2014/01/passing-opencl-clmem-device-address.html

    ReplyDelete
  2. Hello,
    thanks for the post.
    What about CUDA support for live mode? Did you finish this work?

    ReplyDelete
    Replies
    1. @Artem Yes, the support was added, and I did conduct some test runs. No performance numbers though.

      We have had reports though from users trying out cuda in live mode, and it did perform better than the cpu, based on their reports.

      Delete
    2. @poona, good to hear that.
      I'll be waiting for the post with performance numbers.
      It will be very interesting to see how CUDA helps with processing of network data stream in real time.

      Delete
    3. @Artem I won't be coming out with performance numbers anytime soon, but we have requested one such user to come out with a post(either here or his own) describing his perf numbers.

      Delete