Thursday 23 January 2014

Passing a OpenCL cl_mem device address from host to the device, but not as a kernel argument. Pointers for Suricata OpenCL porters.


This post is not specific to Suricata, but rather a generic one, that can help most devs who write OpenCL code + the ones who want to implement OpenCL support inside suricata.  Have been seeing quite a few attempts on porting suricata's CUDA support to use OpenCL.  Before we experimented with CUDA, we had given OpenCL a shot back in the early OpenCL days, when the drivers were in it's infancy and had a ton of bugs, and we, a ton of segvs, leaving us with no clue as to where the bug was - the driver or the code.  The driver might be a lot stabler today, of course.

Either ways, supporting OpenCL in suricata  should be a pretty straightforward task, but there's one issue that needs to be kept in mind while carrying out this port.  Something most folks who contacted me during their port, got stuck at.  And also a question a lot of OpenCL devs have on passing a memory object as a part of a byte stream, structure and not as a kernel argument.

Let's get to the topic at hand.  I will use the example of suricata to explain the issue.

What's the issue?

Suricata buffers a payload and along with the payload, specifies a gpu memory address(cl_mem) that points to the pattern matching state table that the corresponding payload should be matched against.  With CUDA the memory address we are buffering is of type "CUdeviceptr", that is allocated using the call cuMemAlloc().  The value stored inside CUdeviceptr is basically an address from the gpu address space(not a handle).  You can test this by writing a simple program like the one I have below for OpenCL.  You can also check this article that confirms the program's findings.

With OpenCL, cl_mem is defined to be a handle against an address in the gpu address space.  I would have expected Nvidia'a OpenCL implementation to show a behaviour that was similar to it's cuda library, i.e. the handle being nothing but an address in the gpu address space, but it isn't the case(probably has something to do do with the size of cl_mem?).  We can't directly pass the cl_mem handle value as the device address.  We will need to extract the device address out for a particular cl_mem handle, and pass this retrieved value instead.

Here is a sample program -

==get_address.cu==

__kernel void get_address(__global ulong *c)
{
    *c = (ulong)c;
}

==get_address.c==

unsigned long get_address(cl_kernel kernel_address,
                                            cl_command_queue command_queue,
                                            cl_mem dst_mem)
{
    unsigned long result_address = 0;

    BUG_ON(clSetKernelArg(kernel_address, 0,
                                             sizeof(dst_mem), &dst_mem) < 0);

    BUG_ON(clEnqueueNDRangeKernel(command_queue,
                                                                kernel_address,
                                                                1,
                                                                NULL,
                                                                &global_work_size,
                                                                &local_work_size,
                                                                0, NULL,
                                                                 NULL) < 0);
    BUG_ON(clEnqueueReadBuffer(command_queue,
                                                        dst_mem,
                                                        CL_TRUE,
                                                        0,
                                                        sizeof(result_address),
                                                        &result_address,
                                                        0, NULL,
                                                         NULL) < 0);
    return result_address;
}

* Untested code.  Code written keeping in mind a 64 bit hardware on the gpu and the cpu.

Using the above get_address() function should get you the gpu address for a cl_mem instance, and the returned value is what should be passed to the gpu as the address, in place of CUDA's CUdeviceptr.  It's sort of a hack, but it should work.

Another question that pops up in my head is, would the driver change the memory allocated against a handle?  Any AMD/Nvidia driver folks can answer this?

Any alternate solutions(apart from passing all of it as kernel arguments :) ) welcome.


No comments:

Post a Comment