Anoop Saldanha: Suricata cuda engine re-designed

Suricata's current dev master includes a re-designed cuda architecture. There's some history on Suricata's relationship with cuda which I will address in a different post. Let's talk about the current cuda support.

Suricata performs various cpu intensive talks. One of those being the pattern matching phase. The engine has various pattern matching phases - packet, stream, application layer buffers, etc.. The heaviest of them all would be the packet and the stream ones. With our version we have offloaded the packet mpm phase. We use the aho-corasick algorithm on the gpu(more on the technical aspects in the coming posts).

These are the first and last commits in sequence of commits that introduces the feature. A git log should reveal other related commits -

commit 0e0218f089f87c82041bb56b497bd88a41d31040
    Author: Anoop Saldanha <anoopsaldanha@gmail.com>
            Date:   Thu Jun 20 23:26:23 2013 +0530

Minor cosmetic changes to the cuda code.

commit 276716bdd39f325209e03f94b855f0ed14b3b12a
      Author: Anoop Saldanha <poonaatsoc@gmail.com>
Date:   Wed Aug 1 14:22:49 2012 +0530

        update cuda API wrappers ...

Currently the only supported interface/runmode is pcap-file/autofp. Do note that we don't support live rule swap when cuda's being used.

* Configuring suricata to use Cuda

To enable and use the cuda accelerated pattern matcher, configure with --enable-cuda.

Next the new mpm that uses the gpu is called "ac-cuda". You will have to modify the "mpm-algo" parameter in the conf to use it -
"mpm-algo: ac-cuda"

Other customizable options in the conf file -
cuda:
mpm:
    data-buffer-size-min-limit: 0
    data-buffer-size-max-limit: 1500
    cudabuffer-buffer-size: 500mb
    gpu-transfer-size: 50mb
    batching-timeout: 2000
    device-id: 0
    cuda-streams: 2

The conf file has a brief explanation for each parameter. Let me explain them again

gpu-transfer-size: Probably the most important parameter at this point, since a lot of you all won't be able to run cuda with the default value of 50mb(depending on the card you are using). This parameter basically configures the max size of the buffer(one holding all the packets) that would be sent to the gpu for processing. Based on the available memory on your card you will have to play with this value until you get cuda up and running on your system.
data-buffer-size-min-limit: The minimum payload size to be sent to the gpu. For example, if we have it set to 10, all payloads < 10 would be run on the cpu, >= 10 on the gpu. A value of 0 sets no limit.
data-buffer-size-max-limit: The maximum payload size to be sent to the gpu. Similar to the previous one
batching-timeout: This parameter sets the timer for batching packets before being sent to the gpu. Do note this parameter is in microseconds. For example, if you have a value of 10000(10ms) set, the cpu would buffer packets for the next 10ms before sending them to the gpu. You will have to play around with this value to find the performance sweet spot.
device-id: The device to use. If you have multiple devices you can specify the device to use here. You can use suricata --list-cuda-cards to list the configured cards on your system.
cuda-streams: Unused
cudabuffer-buffer-size: Internally we use a circular buffer to batch packets. This parameter specifies the size of this circular buffer. The buffer allocated using this parameter is page-locked.

You'll have to also increase the "max-pending-packets" parameter in the conf. I have mine set to 65000.
max-pending-packets: 65000

Also do note that we only support pcap-file/autofp at this moment.

Looks like we are done. Time to take it for a drive.

* Performance

* Card used - GTX 480 - 15 multiprocessors, with 448 cores and Cuda Compatibility 2.0.
* Host PC: AMD 620 quad core at 2.8Ghz, with 6GB of ram.
* Ruleset being tested - etpro without decoder rules.
* max-pending-packets: 65000
* batching-timeout: 10000

I have managed to run it against some pcaps and the gpu has been faster or as fast as the cpu on all occasions. The alerts are intact on all runs. Here are the results -

* time in seconds
Pcap_Name - CPU :   GPU      : % Increase in Performance
Pcap          - 12.5     : 9.4        : 24%
Pcap    - 18.2 :   14.2    : 22%
Pcap    - 11    : 8.4    : 23%
Pcap            - 7.4       : 5.7        : 23%
Pcap            - 12.8     : 9.9        : 22%
Pcap            - 5.2       : 3.9        : 25%
Pcap            - 18.7     : 14.0      : 25%
Pcap            - 28.4     : 20         : 29%
Pcap          - 13.3     :   10.2      : 15%
Pcap        - 25.9     :   18.2      : 30%
Pcap            - 27.9     :   20.1      : 28%
Pcap            - 29.5     :   21.2      : 28%
Pcap            - 29.7     :   21.5      : 27%
Pcap            - 17.3     :   12.9      : 25%
Pcap            - 23.3     :   18.0      : 23%
Pcap            - 5.8    :   5.45      : 6%
Pcap            - 83       :   72         : 13%
Pcap            - 10440 :   9575     : 8%
Pcap            - 7445    :   7172     : 3.7%
Pcap            - 340      :   271       : 20%
Pcap            - 604      :   603       : -
Pcap            - 1480    :   1452     : 1.8%
Pcap            - 16.1     :   16.1      : 0%
Pcap            - 12.9     :   12.5      : 3%
Pcap            - 6.7       :   6.3        : 3%
Pcap            - 6.7       :   5.9        : 12%
Pcap            - 8.1       :   7.3        : 11%
Pcap            - 3.7       :   3.3        : 11%
Pcap            - 9.3       :   8.85      : 5%
Pcap            - 27.5     :   27.5      : 0%
Pcap            - 16.1     :   17.4      : 9%

Please note the code is experimental, and we would love to your hear your feedback on the performance, card you are running, conf settings used and alert accuracy.

* Future Work

Use other features provided by cuda, the immediate ones being streams and texture memory.
Provide live mode support.
Explore the possibility of sending other buffers to the gpu.
Explore other cpu intensive tasks that can be offloaded to the gpu.

In the upcoming posts we will discuss suricata's cuda history and the code development that took place for each cuda version we implemented. We will also discuss the technical aspects behind the current cuda code.

14 comments:

poona6 February 2014 at 21:03
Tips for OpenCL developers in general + anyone who is planning to port Suricata CUDA to use OpenCL -
http://www.poona.me/2014/01/passing-opencl-clmem-device-address.html
Unknown31 March 2014 at 00:00
Hello,
thanks for the post.
What about CUDA support for live mode? Did you finish this work?
Mr. Goat20 February 2015 at 08:55
Nice post poona, thank you! I build a configuration to benchmark suricata 2.0.6 with GPU support. The system contains a dual E5-2620 Xeon and a GTX980 GPU. I use Ubuntu 14.04 with Cuda toolkit 6.5 (the one for the GTX980).

I did some benchmarking, but using only the CPU seems to be faster then using the GPU support. Did I miss something? Is there maybe a problem with the latest Cuda toolkit?
poona20 February 2015 at 10:45
The gpu code hasn't been updated in a very long time and it might be a bit back on the architecture front both from the suricata cuda internals perspective and also on the cuda toolkit front and that can affect perf as well.

Having said that you can tune the current configuration, you can play around with these parameters -

max-pending-packets needs to be increased to a really high value.
Back then the max value was 65k, but now I think you can go up much
further.

batching-timetout parameter makes a lot of difference. It tells for
how long the cpu section of the code needs to batch together packets
to send it over.
Mr. Goat23 February 2015 at 02:56
Thank you for your quick reply.

I did already play some with the parameters you describe, but I will do some further testing. Maybe an older version of the CUDA toolkit will do the job. What version did you use?
Unknown12 March 2015 at 06:32
Hi,
I want to process packet in batch on gpu. Do you have any sample code that can explain how to offload packet to gpu and how to process that packet on gpu? Any help is appriciated.
poona12 March 2015 at 07:21
You can go through suricata's codebase for this - util-mpm-ac.c

Anoop Saldanha

Pages

Friday, 21 June 2013

Suricata cuda engine re-designed

14 comments:

Blog Archive