Thursday, 23 January 2014

Passing a OpenCL cl_mem device address from host to the device, but not as a kernel argument. Pointers for Suricata OpenCL porters.


This post is not specific to Suricata, but rather a generic one, that can help most devs who write OpenCL code + the ones who want to implement OpenCL support inside suricata.  Have been seeing quite a few attempts on porting suricata's CUDA support to use OpenCL.  Before we experimented with CUDA, we had given OpenCL a shot back in the early OpenCL days, when the drivers were in it's infancy and had a ton of bugs, and we, a ton of segvs, leaving us with no clue as to where the bug was - the driver or the code.  The driver might be a lot stabler today, of course.

Either ways, supporting OpenCL in suricata  should be a pretty straightforward task, but there's one issue that needs to be kept in mind while carrying out this port.  Something most folks who contacted me during their port, got stuck at.  And also a question a lot of OpenCL devs have on passing a memory object as a part of a byte stream, structure and not as a kernel argument.

Let's get to the topic at hand.  I will use the example of suricata to explain the issue.

What's the issue?

Suricata buffers a payload and along with the payload, specifies a gpu memory address(cl_mem) that points to the pattern matching state table that the corresponding payload should be matched against.  With CUDA the memory address we are buffering is of type "CUdeviceptr", that is allocated using the call cuMemAlloc().  The value stored inside CUdeviceptr is basically an address from the gpu address space(not a handle).  You can test this by writing a simple program like the one I have below for OpenCL.  You can also check this article that confirms the program's findings.

With OpenCL, cl_mem is defined to be a handle against an address in the gpu address space.  I would have expected Nvidia'a OpenCL implementation to show a behaviour that was similar to it's cuda library, i.e. the handle being nothing but an address in the gpu address space, but it isn't the case(probably has something to do do with the size of cl_mem?).  We can't directly pass the cl_mem handle value as the device address.  We will need to extract the device address out for a particular cl_mem handle, and pass this retrieved value instead.

Here is a sample program -

==get_address.cu==

__kernel void get_address(__global ulong *c)
{
    *c = (ulong)c;
}

==get_address.c==

unsigned long get_address(cl_kernel kernel_address,
                                            cl_command_queue command_queue,
                                            cl_mem dst_mem)
{
    unsigned long result_address = 0;

    BUG_ON(clSetKernelArg(kernel_address, 0,
                                             sizeof(dst_mem), &dst_mem) < 0);

    BUG_ON(clEnqueueNDRangeKernel(command_queue,
                                                                kernel_address,
                                                                1,
                                                                NULL,
                                                                &global_work_size,
                                                                &local_work_size,
                                                                0, NULL,
                                                                 NULL) < 0);
    BUG_ON(clEnqueueReadBuffer(command_queue,
                                                        dst_mem,
                                                        CL_TRUE,
                                                        0,
                                                        sizeof(result_address),
                                                        &result_address,
                                                        0, NULL,
                                                         NULL) < 0);
    return result_address;
}

* Untested code.  Code written keeping in mind a 64 bit hardware on the gpu and the cpu.

Using the above get_address() function should get you the gpu address for a cl_mem instance, and the returned value is what should be passed to the gpu as the address, in place of CUDA's CUdeviceptr.  It's sort of a hack, but it should work.

Another question that pops up in my head is, would the driver change the memory allocated against a handle?  Any AMD/Nvidia driver folks can answer this?

Any alternate solutions(apart from passing all of it as kernel arguments :) ) welcome.


Monday, 13 January 2014

Suricata app layer changes. New keyword - app-layer-protocol introduced

Suricata current master has undergone some major rewrite on its app layer code. This includes app layer protocol detection and the app layer parsing phase. While doing this, it has also introduced a new keyword - "app-layer-protocol". There are certain other changes on how we can now specify an app layer protocol in a rule and how it interacts with an "ipproto(both rule specified or through ipproto keyword)".

App layer rewite

Let me start by introducing the app layer rewrite. The old app layer code had the protocol detection and parsing phase all jumbled up. There was no proper separation between the two. Also the internal app layer protocol registration didn't have a hierarchy based on ip protocol, which meant one couldn't register anything against an app layer protocol without modifying the protocol name by appending the ipproto to it.

For example, to specify a signature to match on the udp and tcp variant of dns protocol respectively, one would have to write -

    alert dnsudp ......  /* to match on dns udp */
    alert dnstcp .....   /* to match on dns tcp */

This is now replaced by the cleaner -

     alert dns (ipproto:udp; ); OR 
     alert udp (app-layer-protocol:dns;) OR 
     alert ip (app-layer-protocol:dns; ipproto:udp;)  

New keyword: "app-layer-protocol"

This feature came up with the need to match on negated protocols(feature #727).   An an example we want to match on the string "foo" on all app layer streams which are not http -

alert tcp any any -> any any (app-layer-protocol:!"http"; content:"foo"; sid:1;) 

Interaction between app-layer-protocol, ipproto and the protocol specified using alert <protocol>

Let's work with some examples.

- Match on dns protocol against all ip-protocols.

  alert ip (app-layer-protocol:dns;) 
  alert dns ()

- Match on udp version of dns protocol.

  alert udp (app-layer-protocol:dns;) 
  alert dns (ipproto:udp;) 
  alert ip (app-layer-protocol:dns; ipproto:udp;)

- Match on tcp and udp version of dns protocol.

  alert udp (app-layer-protocol:dns; ipproto:tcp; ) /* XXX Nooooooo...... */
  The above is not allowed.  ipproto keyword can be used only with alert ip.

  alert dns (ipproto:tcp; ipproto:udp;)
  alert ip (app-layer-protocol:dns; ipproto:udp; ipproto:tcp;)

What do all these changes mean for the engine

- We have a neater app layer phase. There's a clear separation between the app layer protocol module and app layer parser module.

- Ipproto based hierarchy in any registration related to an app layer protocol. From a rule_write/user perspective, this removes the need to tag an ipproto along with the app_protocol name. For example, we no longer have dnstcp and dnsudp.

- Introduction of the new "app-layer-protocol" keyword allows for richer specification when used along with other keywords like "ipproto".

- Conf yaml changes

  Previously,
  dnstcp:
      enabled: yes
      detection-ports:
          tcp: toserver: 53
  dnsudp:
      enabled: yes
      detection-ports:
          udp: toserver: 53

  Now,
  dns:
      tcp:
          enabled: yes
          detection-ports:
              toserver: 53
      udp:
          enabled: yes
          detection-ports:
              toserver: 53

Friday, 21 June 2013

Suricata cuda engine re-designed


Suricata's current dev master includes a re-designed cuda architecture.  There's some history on Suricata's relationship with cuda which I will address in a different post.  Let's talk about the current cuda support.

Suricata performs various cpu intensive talks.  One of those being the pattern matching phase.  The engine has various pattern matching phases - packet, stream, application layer buffers, etc..  The heaviest of them all would be the packet and the stream ones.  With our version we have offloaded the packet mpm phase.  We use the aho-corasick algorithm on the gpu(more on the technical aspects in the coming posts).

These are the first and last commits in sequence of commits that introduces the feature.  A git log should reveal other related commits -

            commit 0e0218f089f87c82041bb56b497bd88a41d31040
            Author: Anoop Saldanha <anoopsaldanha@gmail.com>
            Date:   Thu Jun 20 23:26:23 2013 +0530

                Minor cosmetic changes to the cuda code.


            commit 276716bdd39f325209e03f94b855f0ed14b3b12a
            Author: Anoop Saldanha <poonaatsoc@gmail.com>
            Date:   Wed Aug 1 14:22:49 2012 +0530

                update cuda API wrappers ...



Currently the only supported interface/runmode is pcap-file/autofp.  Do note that we don't support live rule swap when cuda's being used.

* Configuring suricata to use Cuda

To enable and use the cuda accelerated pattern matcher, configure with --enable-cuda.

Next the new mpm that uses the gpu is called "ac-cuda".  You will have to modify the "mpm-algo" parameter in the conf to use it -
"mpm-algo: ac-cuda"

Other customizable options in the conf file -
cuda:
  mpm:
    data-buffer-size-min-limit: 0
    data-buffer-size-max-limit: 1500
    cudabuffer-buffer-size: 500mb
    gpu-transfer-size: 50mb
    batching-timeout: 2000
    device-id: 0
    cuda-streams: 2

The conf file has a brief explanation for each parameter.  Let me explain them again
  • gpu-transfer-size: Probably the most important parameter at this point, since a lot of you all won't be able to run cuda with the default value of 50mb(depending on the card you are using).  This parameter basically configures the max size of the buffer(one holding all the packets) that would be sent to the gpu for processing.  Based on the available memory on your card you will have to play with this value until you get cuda up and running on your system.
  • data-buffer-size-min-limit: The minimum payload size to be sent to the gpu.  For example, if we have it set to 10, all payloads < 10 would be run on the cpu, >= 10 on the gpu.  A value of 0 sets no limit.
  • data-buffer-size-max-limit: The maximum payload size to be sent to the gpu.  Similar to the previous one
  • batching-timeout: This parameter sets the timer for batching packets before being sent to the gpu.  Do note this parameter is in microseconds.  For example, if you have a value of 10000(10ms) set, the cpu would buffer packets for the next 10ms before sending them to the gpu.  You will have to play around with this value to find the performance sweet spot.
  • device-id: The device to use.  If you have multiple devices you can specify the device to use here.  You can use suricata --list-cuda-cards to list the configured cards on your system.
  • cuda-streams: Unused
  • cudabuffer-buffer-size: Internally we use a circular buffer to batch packets.  This parameter specifies the size of this circular buffer.  The buffer allocated using this parameter is page-locked.
You'll have to also increase the "max-pending-packets" parameter in the conf.  I have mine set to 65000.
max-pending-packets: 65000

Also do note that we only support pcap-file/autofp at this moment.

Looks like we are done.  Time to take it for a drive.

* Performance

* Card used - GTX 480 - 15 multiprocessors, with 448 cores and Cuda Compatibility 2.0.
* Host PC: AMD 620 quad core at 2.8Ghz, with 6GB of ram.
* Ruleset being tested - etpro without decoder rules.
* max-pending-packets: 65000
* batching-timeout: 10000

I have managed to run it against some pcaps and the gpu has been faster or as fast as the cpu on all occasions.  The alerts are intact on all runs.  Here are the results -

* time in seconds
Pcap_Name - CPU     :   GPU      : % Increase in Performance
Pcap            - 12.5     :   9.4        : 24%
Pcap            - 18.2     :   14.2      : 22%
Pcap            - 11        :   8.4        : 23%
Pcap            - 7.4       :   5.7        : 23%
Pcap            - 12.8     :   9.9        : 22%
Pcap            - 5.2       :   3.9        : 25%
Pcap            - 18.7     :   14.0      : 25%
Pcap            - 28.4     :   20         : 29%
Pcap            - 13.3     :   10.2      : 15%
Pcap            - 25.9     :   18.2      : 30%
Pcap            - 27.9     :   20.1      : 28%
Pcap            - 29.5     :   21.2      : 28%
Pcap            - 29.7     :   21.5      : 27%
Pcap            - 17.3     :   12.9      : 25%
Pcap            - 23.3     :   18.0      : 23%
Pcap            -  5.8      :   5.45      :  6%
Pcap            -  83       :   72         : 13%
Pcap            - 10440  :   9575     : 8%
Pcap            - 7445    :   7172     : 3.7%
Pcap            - 340      :   271       : 20%
Pcap            - 604      :   603       : -
Pcap            - 1480    :   1452     : 1.8%
Pcap            - 16.1     :   16.1      : 0%
Pcap            - 12.9     :   12.5      : 3%
Pcap            - 6.7       :   6.3        : 3%
Pcap            - 6.7       :   5.9        : 12%
Pcap            - 8.1       :   7.3        : 11%
Pcap            - 3.7       :   3.3        : 11%
Pcap            - 9.3       :   8.85      : 5%
Pcap            - 27.5     :   27.5      : 0%
Pcap            - 16.1     :   17.4      : 9%


Please note the code is experimental, and we would love to your hear your feedback on the performance, card you are running, conf settings used and alert accuracy.


* Future Work
  • Use other features provided by cuda, the immediate ones being streams and texture memory.
  • Provide live mode support.
  • Explore the possibility of sending other buffers to the gpu.
  • Explore other cpu intensive tasks that can be offloaded to the gpu.

In the upcoming posts we will discuss suricata's cuda history and the code development that took place for each cuda version we implemented.  We will also discuss the technical aspects behind the current cuda code.

Friday, 17 May 2013

Suricata transaction engine re-designed - Increased performance, better accuracy.


For quite sometime we wanted to improve the stateful detection engine inside suricata.  The previous detection engine although worked fine in a way, had its issues, some of them being these -

  • Repeated inspection of same app state, transactions included, and as a side effect of this we had the pattern matching engine carrying out redundant runs on already matched buffers.
  • FPs resulting from re-inspection of state.
  • FNs.
  • FPs from cross transaction matching.  For example the following sig would FP with the old engine for the below scenario.  Assume we have a flow with 2 requests -
        GET /one.html HTTP/1.1
        GET /two.html HTTP/1.1

        alert http any any -> any any (content:"one"; http_uri; content:"two"; http_uri; sid:1;)


We have now re-designed the way we carry out inspection(master branch), and all the above issues mentioned have disappeared, along with noticeably improved performance.

The corresponding commits being -

commit b0f014124dbf44829ba04ed9d090ff268f7cb0ae
Author: Anoop Saldanha <anoopsaldanha@gmail.com>
Date:   Fri May 3 20:34:58 2013 +0530

    Transaction engine redesigned.....

commit e71de3f98f713fd4fe6cbccf42c51e59b0fca848
Author: Anoop Saldanha <anoopsaldanha@gmail.com>
Date:   Fri May 3 10:03:48 2013 +0530

    Track transaction progress separately......

commit 6ebd360c225ccffab0ec65099e4f0b4882945b25
Author: Anoop Saldanha <anoopsaldanha@gmail.com>
Date:   Fri Apr 12 13:18:17 2013 +0530

    hsbd mpm and packet mpm share same mpm ctx id.....

Here are the stats from a private pcap containing 12826 http requests and which alerts 4033 times with the old engine.

1. The no of times the pattern matching engine was called on buffers

----------------------------------------
mpm - old engine : new engine
----------------------------------------
uri - 179k : 13k
http client body - 222 : 173
http header toserver - 179k : 13k
http header toclient - 174k : 117k
http method - 143k : 10.5k
http cookie - 10.5k : 6.5k
http raw uri - 143k : 10.5k

As you can see the pattern matching runs has drastically reduced.

2. Alert accuracy has been improved, with FPs and FN's disappearing.


-Default yaml-

Old engine - 40.5 seconds
New engine - 33.5 seconds
Performance increase  - 17.28%

I then modified the HOME_NET and EXTERNAL_NET to any, to increase the flows inspected by the engine, and to also increase the no of alerts.  These are the numbers obtained -

-Modified yaml-

Old engine - 70.5
New engine - 51.5
Performance increase - 27%

Suricata users with http heavy traffic and a fairly http heavy ruleset, should see the increase in performance as well.


Please do note the code update requires some rigorous testing, so keep an eye out for missed alerts, segvs and other bugs.  Any form of testing for alert, stability, and performance is appreciated.

=Future Work=

Effort continues to further improve the detection engine for better performance and a lot of cool new features.  Keep an eye out for our mailing list.