Thursday 23 January 2014

Passing a OpenCL cl_mem device address from host to the device, but not as a kernel argument. Pointers for Suricata OpenCL porters.


This post is not specific to Suricata, but rather a generic one, that can help most devs who write OpenCL code + the ones who want to implement OpenCL support inside suricata.  Have been seeing quite a few attempts on porting suricata's CUDA support to use OpenCL.  Before we experimented with CUDA, we had given OpenCL a shot back in the early OpenCL days, when the drivers were in it's infancy and had a ton of bugs, and we, a ton of segvs, leaving us with no clue as to where the bug was - the driver or the code.  The driver might be a lot stabler today, of course.

Either ways, supporting OpenCL in suricata  should be a pretty straightforward task, but there's one issue that needs to be kept in mind while carrying out this port.  Something most folks who contacted me during their port, got stuck at.  And also a question a lot of OpenCL devs have on passing a memory object as a part of a byte stream, structure and not as a kernel argument.

Let's get to the topic at hand.  I will use the example of suricata to explain the issue.

What's the issue?

Suricata buffers a payload and along with the payload, specifies a gpu memory address(cl_mem) that points to the pattern matching state table that the corresponding payload should be matched against.  With CUDA the memory address we are buffering is of type "CUdeviceptr", that is allocated using the call cuMemAlloc().  The value stored inside CUdeviceptr is basically an address from the gpu address space(not a handle).  You can test this by writing a simple program like the one I have below for OpenCL.  You can also check this article that confirms the program's findings.

With OpenCL, cl_mem is defined to be a handle against an address in the gpu address space.  I would have expected Nvidia'a OpenCL implementation to show a behaviour that was similar to it's cuda library, i.e. the handle being nothing but an address in the gpu address space, but it isn't the case(probably has something to do do with the size of cl_mem?).  We can't directly pass the cl_mem handle value as the device address.  We will need to extract the device address out for a particular cl_mem handle, and pass this retrieved value instead.

Here is a sample program -

==get_address.cu==

__kernel void get_address(__global ulong *c)
{
    *c = (ulong)c;
}

==get_address.c==

unsigned long get_address(cl_kernel kernel_address,
                                            cl_command_queue command_queue,
                                            cl_mem dst_mem)
{
    unsigned long result_address = 0;

    BUG_ON(clSetKernelArg(kernel_address, 0,
                                             sizeof(dst_mem), &dst_mem) < 0);

    BUG_ON(clEnqueueNDRangeKernel(command_queue,
                                                                kernel_address,
                                                                1,
                                                                NULL,
                                                                &global_work_size,
                                                                &local_work_size,
                                                                0, NULL,
                                                                 NULL) < 0);
    BUG_ON(clEnqueueReadBuffer(command_queue,
                                                        dst_mem,
                                                        CL_TRUE,
                                                        0,
                                                        sizeof(result_address),
                                                        &result_address,
                                                        0, NULL,
                                                         NULL) < 0);
    return result_address;
}

* Untested code.  Code written keeping in mind a 64 bit hardware on the gpu and the cpu.

Using the above get_address() function should get you the gpu address for a cl_mem instance, and the returned value is what should be passed to the gpu as the address, in place of CUDA's CUdeviceptr.  It's sort of a hack, but it should work.

Another question that pops up in my head is, would the driver change the memory allocated against a handle?  Any AMD/Nvidia driver folks can answer this?

Any alternate solutions(apart from passing all of it as kernel arguments :) ) welcome.


Monday 13 January 2014

Suricata app layer changes. New keyword - app-layer-protocol introduced

Suricata current master has undergone some major rewrite on its app layer code. This includes app layer protocol detection and the app layer parsing phase. While doing this, it has also introduced a new keyword - "app-layer-protocol". There are certain other changes on how we can now specify an app layer protocol in a rule and how it interacts with an "ipproto(both rule specified or through ipproto keyword)".

App layer rewite

Let me start by introducing the app layer rewrite. The old app layer code had the protocol detection and parsing phase all jumbled up. There was no proper separation between the two. Also the internal app layer protocol registration didn't have a hierarchy based on ip protocol, which meant one couldn't register anything against an app layer protocol without modifying the protocol name by appending the ipproto to it.

For example, to specify a signature to match on the udp and tcp variant of dns protocol respectively, one would have to write -

    alert dnsudp ......  /* to match on dns udp */
    alert dnstcp .....   /* to match on dns tcp */

This is now replaced by the cleaner -

     alert dns (ipproto:udp; ); OR 
     alert udp (app-layer-protocol:dns;) OR 
     alert ip (app-layer-protocol:dns; ipproto:udp;)  

New keyword: "app-layer-protocol"

This feature came up with the need to match on negated protocols(feature #727).   An an example we want to match on the string "foo" on all app layer streams which are not http -

alert tcp any any -> any any (app-layer-protocol:!"http"; content:"foo"; sid:1;) 

Interaction between app-layer-protocol, ipproto and the protocol specified using alert <protocol>

Let's work with some examples.

- Match on dns protocol against all ip-protocols.

  alert ip (app-layer-protocol:dns;) 
  alert dns ()

- Match on udp version of dns protocol.

  alert udp (app-layer-protocol:dns;) 
  alert dns (ipproto:udp;) 
  alert ip (app-layer-protocol:dns; ipproto:udp;)

- Match on tcp and udp version of dns protocol.

  alert udp (app-layer-protocol:dns; ipproto:tcp; ) /* XXX Nooooooo...... */
  The above is not allowed.  ipproto keyword can be used only with alert ip.

  alert dns (ipproto:tcp; ipproto:udp;)
  alert ip (app-layer-protocol:dns; ipproto:udp; ipproto:tcp;)

What do all these changes mean for the engine

- We have a neater app layer phase. There's a clear separation between the app layer protocol module and app layer parser module.

- Ipproto based hierarchy in any registration related to an app layer protocol. From a rule_write/user perspective, this removes the need to tag an ipproto along with the app_protocol name. For example, we no longer have dnstcp and dnsudp.

- Introduction of the new "app-layer-protocol" keyword allows for richer specification when used along with other keywords like "ipproto".

- Conf yaml changes

  Previously,
  dnstcp:
      enabled: yes
      detection-ports:
          tcp: toserver: 53
  dnsudp:
      enabled: yes
      detection-ports:
          udp: toserver: 53

  Now,
  dns:
      tcp:
          enabled: yes
          detection-ports:
              toserver: 53
      udp:
          enabled: yes
          detection-ports:
              toserver: 53