Friday 21 June 2013

Suricata cuda engine re-designed


Suricata's current dev master includes a re-designed cuda architecture.  There's some history on Suricata's relationship with cuda which I will address in a different post.  Let's talk about the current cuda support.

Suricata performs various cpu intensive talks.  One of those being the pattern matching phase.  The engine has various pattern matching phases - packet, stream, application layer buffers, etc..  The heaviest of them all would be the packet and the stream ones.  With our version we have offloaded the packet mpm phase.  We use the aho-corasick algorithm on the gpu(more on the technical aspects in the coming posts).

These are the first and last commits in sequence of commits that introduces the feature.  A git log should reveal other related commits -

            commit 0e0218f089f87c82041bb56b497bd88a41d31040
            Author: Anoop Saldanha <anoopsaldanha@gmail.com>
            Date:   Thu Jun 20 23:26:23 2013 +0530

                Minor cosmetic changes to the cuda code.


            commit 276716bdd39f325209e03f94b855f0ed14b3b12a
            Author: Anoop Saldanha <poonaatsoc@gmail.com>
            Date:   Wed Aug 1 14:22:49 2012 +0530

                update cuda API wrappers ...



Currently the only supported interface/runmode is pcap-file/autofp.  Do note that we don't support live rule swap when cuda's being used.

* Configuring suricata to use Cuda

To enable and use the cuda accelerated pattern matcher, configure with --enable-cuda.

Next the new mpm that uses the gpu is called "ac-cuda".  You will have to modify the "mpm-algo" parameter in the conf to use it -
"mpm-algo: ac-cuda"

Other customizable options in the conf file -
cuda:
  mpm:
    data-buffer-size-min-limit: 0
    data-buffer-size-max-limit: 1500
    cudabuffer-buffer-size: 500mb
    gpu-transfer-size: 50mb
    batching-timeout: 2000
    device-id: 0
    cuda-streams: 2

The conf file has a brief explanation for each parameter.  Let me explain them again
  • gpu-transfer-size: Probably the most important parameter at this point, since a lot of you all won't be able to run cuda with the default value of 50mb(depending on the card you are using).  This parameter basically configures the max size of the buffer(one holding all the packets) that would be sent to the gpu for processing.  Based on the available memory on your card you will have to play with this value until you get cuda up and running on your system.
  • data-buffer-size-min-limit: The minimum payload size to be sent to the gpu.  For example, if we have it set to 10, all payloads < 10 would be run on the cpu, >= 10 on the gpu.  A value of 0 sets no limit.
  • data-buffer-size-max-limit: The maximum payload size to be sent to the gpu.  Similar to the previous one
  • batching-timeout: This parameter sets the timer for batching packets before being sent to the gpu.  Do note this parameter is in microseconds.  For example, if you have a value of 10000(10ms) set, the cpu would buffer packets for the next 10ms before sending them to the gpu.  You will have to play around with this value to find the performance sweet spot.
  • device-id: The device to use.  If you have multiple devices you can specify the device to use here.  You can use suricata --list-cuda-cards to list the configured cards on your system.
  • cuda-streams: Unused
  • cudabuffer-buffer-size: Internally we use a circular buffer to batch packets.  This parameter specifies the size of this circular buffer.  The buffer allocated using this parameter is page-locked.
You'll have to also increase the "max-pending-packets" parameter in the conf.  I have mine set to 65000.
max-pending-packets: 65000

Also do note that we only support pcap-file/autofp at this moment.

Looks like we are done.  Time to take it for a drive.

* Performance

* Card used - GTX 480 - 15 multiprocessors, with 448 cores and Cuda Compatibility 2.0.
* Host PC: AMD 620 quad core at 2.8Ghz, with 6GB of ram.
* Ruleset being tested - etpro without decoder rules.
* max-pending-packets: 65000
* batching-timeout: 10000

I have managed to run it against some pcaps and the gpu has been faster or as fast as the cpu on all occasions.  The alerts are intact on all runs.  Here are the results -

* time in seconds
Pcap_Name - CPU     :   GPU      : % Increase in Performance
Pcap            - 12.5     :   9.4        : 24%
Pcap            - 18.2     :   14.2      : 22%
Pcap            - 11        :   8.4        : 23%
Pcap            - 7.4       :   5.7        : 23%
Pcap            - 12.8     :   9.9        : 22%
Pcap            - 5.2       :   3.9        : 25%
Pcap            - 18.7     :   14.0      : 25%
Pcap            - 28.4     :   20         : 29%
Pcap            - 13.3     :   10.2      : 15%
Pcap            - 25.9     :   18.2      : 30%
Pcap            - 27.9     :   20.1      : 28%
Pcap            - 29.5     :   21.2      : 28%
Pcap            - 29.7     :   21.5      : 27%
Pcap            - 17.3     :   12.9      : 25%
Pcap            - 23.3     :   18.0      : 23%
Pcap            -  5.8      :   5.45      :  6%
Pcap            -  83       :   72         : 13%
Pcap            - 10440  :   9575     : 8%
Pcap            - 7445    :   7172     : 3.7%
Pcap            - 340      :   271       : 20%
Pcap            - 604      :   603       : -
Pcap            - 1480    :   1452     : 1.8%
Pcap            - 16.1     :   16.1      : 0%
Pcap            - 12.9     :   12.5      : 3%
Pcap            - 6.7       :   6.3        : 3%
Pcap            - 6.7       :   5.9        : 12%
Pcap            - 8.1       :   7.3        : 11%
Pcap            - 3.7       :   3.3        : 11%
Pcap            - 9.3       :   8.85      : 5%
Pcap            - 27.5     :   27.5      : 0%
Pcap            - 16.1     :   17.4      : 9%


Please note the code is experimental, and we would love to your hear your feedback on the performance, card you are running, conf settings used and alert accuracy.


* Future Work
  • Use other features provided by cuda, the immediate ones being streams and texture memory.
  • Provide live mode support.
  • Explore the possibility of sending other buffers to the gpu.
  • Explore other cpu intensive tasks that can be offloaded to the gpu.

In the upcoming posts we will discuss suricata's cuda history and the code development that took place for each cuda version we implemented.  We will also discuss the technical aspects behind the current cuda code.

14 comments:

  1. Tips for OpenCL developers in general + anyone who is planning to port Suricata CUDA to use OpenCL -
    http://www.poona.me/2014/01/passing-opencl-clmem-device-address.html

    ReplyDelete
  2. Hello,
    thanks for the post.
    What about CUDA support for live mode? Did you finish this work?

    ReplyDelete
    Replies
    1. @Artem Yes, the support was added, and I did conduct some test runs. No performance numbers though.

      We have had reports though from users trying out cuda in live mode, and it did perform better than the cpu, based on their reports.

      Delete
    2. @poona, good to hear that.
      I'll be waiting for the post with performance numbers.
      It will be very interesting to see how CUDA helps with processing of network data stream in real time.

      Delete
    3. @Artem I won't be coming out with performance numbers anytime soon, but we have requested one such user to come out with a post(either here or his own) describing his perf numbers.

      Delete
  3. Nice post poona, thank you! I build a configuration to benchmark suricata 2.0.6 with GPU support. The system contains a dual E5-2620 Xeon and a GTX980 GPU. I use Ubuntu 14.04 with Cuda toolkit 6.5 (the one for the GTX980).

    I did some benchmarking, but using only the CPU seems to be faster then using the GPU support. Did I miss something? Is there maybe a problem with the latest Cuda toolkit?

    ReplyDelete
    Replies
    1. Hi, I'm trying to configure Suricata 2.0.6 with Cuda support for a GTX970 and Cuda toolkit 6.5. I'm running into a problem where I get a bunch on undefined references at link time. I was wondering if you had the same problem, and if so how you fixed it.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Hi Blake, I changed some lines in the file "src/Makefile". I changed the reference of sm_10 to sm_11, because sm_10 deprecated in the Cuda toolkit 6.5. Beside that I added sm_52 (please note this is done on several places). If you still have some problems I can send you my Makefile.

      Delete
  4. The gpu code hasn't been updated in a very long time and it might be a bit back on the architecture front both from the suricata cuda internals perspective and also on the cuda toolkit front and that can affect perf as well.

    Having said that you can tune the current configuration, you can play around with these parameters -

    max-pending-packets needs to be increased to a really high value.
    Back then the max value was 65k, but now I think you can go up much
    further.

    batching-timetout parameter makes a lot of difference. It tells for
    how long the cpu section of the code needs to batch together packets
    to send it over.

    ReplyDelete
    Replies
    1. Adding the sm_52 instruction to the src/Makefile/, makes suricata with GPU perform equal to the performance without the GPU.

      I did some benchmarking on the batching-timeout parameter, but afterwards I found some problems with this what made the benchmark unreliable. To be continued.

      Delete
  5. Thank you for your quick reply.

    I did already play some with the parameters you describe, but I will do some further testing. Maybe an older version of the CUDA toolkit will do the job. What version did you use?

    ReplyDelete
  6. Hi,
    I want to process packet in batch on gpu. Do you have any sample code that can explain how to offload packet to gpu and how to process that packet on gpu? Any help is appriciated.

    ReplyDelete
  7. You can go through suricata's codebase for this - util-mpm-ac.c

    ReplyDelete