Tuesday, February 23, 2016

OpenCL Embedded Profile (GC2000) - Hello World Application


Based on our last post, let's take a look on a very basic "hello world" application. But before the walkthrough there are some information you need to be aware of. 

OpenCL applications are divided basically in 2 parts: Host and Device codes.

The code for host is responsible for the  hardware initialization, create CL objects (memory, program, kernel, command queues), send signals to device in order to write/read/execute/sync data. 

The code for device is the OpenCL Kernel (C99 based language), which will be accelerated by the GPU. This is a veeeeeeeeery basic description just to keep things as easy as possible, you can find a lot of good  and detailed information on the web to cover it in deep.

Below is the flow for an OpenCL application:




As an example, let's use a very simple CL kernel that only makes a copy from the input  to the output buffer.

Please, note that the pieces of code below was modified from original source (link available down in this post) just to make it more readable , since they are placed in separated functions.

STEP 1 - Defining the Problem Size


OpenCL is meant to solve specific problems, which means that once defined how your kernel will operate, its arguments list and size of your memory objects can't be modified unless releasing all the objects and re-create them with the new desired values.

Defining the problem size means defining our global work-group size and dimension, it can be the length of an array (1D - our use case) or a 2D/3D matrix (we will see it in another sample code on a future post). 

On our Hello World application the Global work-group size is 512, populated with random data (just for testing). Also we need to set our local work-group size (Global work-group data access). Based on the last post, the preferred local work-group size is 16 (per dimension), we will use this value for better performance.

        
        cl_platform_id  platform_id;
        cl_device_id  device_id;
        cl_context  context;
        cl_command_queue cq;
        cl_program  program;
        cl_kernel  kernel;
        cl_mem helloworld_mem_input = NULL;
        cl_mem helloworld_mem_output = NULL;

        // one dimensional work-items
 int dimension = 1;
 
 // our problem size
 size_t global = 512;
 
 // preferred work-group size
 size_t local = 16;
 
 int size;
 
 // input data buffer - random values for the helloworld sample
 char *input_data_buffer;
 
 // output data_buffer for results
 char *output_data_buffer;

 cl_int ret;
 
        // make our size equals our global work-group size
 size = global;
  
 input_data_buffer = (char *) malloc (sizeof (char) * size);
 if (! input_data_buffer)
 {
  printf ("\nFailed to allocate input data buffer memory\n");
  return 0;
 }

 output_data_buffer = (char *) malloc (sizeof (char) * size);
 if (! output_data_buffer)
 {
  printf ("\nFailed to allocate output data buffer memory\n");
  return 0;
 }
 
 // populate data_buffer with random values 
 for (int i = 0; i < size; i++)
 {
  input_data_buffer[i] = rand () % 255;
 }


STEP 2 - Hardware Initialization


This is the basic step to an OpenCL application. Hardware initialization consists in:
  • Listing available platforms (one GC2000)
  • Getting compute device information (Vivante OCL EP device)
  • Create the CL Context
  • Create the Command Queue (Control information from Host to Device)

 
       cl_uint  platforms, devices;
 cl_int error;

 //-------------------------------------------
 // cl_int clGetPlatformIDs (cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms)
 //--------------------------------------------
 error = clGetPlatformIDs (1, &platform_id, &platforms);
 if (error != CL_SUCCESS) 
  return CL_ERROR;

 //--------------------------------------------
 // cl_int clGetDeviceIDs (cl_platform_id platform, cl_device_type device_type, cl_uint num_entries, 
 //   cl_device_id *devices, cl_uint *num_devices)
 //--------------------------------------------
 error = clGetDeviceIDs (platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &devices);
 if (error != CL_SUCCESS) 
  return CL_ERROR;
 
 //--------------------------------------------
 // cl_context clCreateContext (cl_context_properties *properties, cl_uint num_devices, 
 //    const cl_device_id *devices, void *pfn_notify (const char *errinfo, 
 //    const void *private_info, size_t cb, void *user_data),  
 //    void *user_data, cl_int *errcode_ret)
 //----------------------------------------------
 cl_context_properties properties[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)platform_id, 0};
 context = clCreateContext (properties, 1, &device_id, NULL, NULL, &error);
 if (error != CL_SUCCESS) 
  return CL_ERROR;
 
 //----------------------------------------------
 // cl_command_queue clCreateCommandQueue (cl_context context, cl_device_id device, 
 //     cl_command_queue_properties properties, cl_int *errcode_ret)
 //----------------------------------------------- 
 cq = clCreateCommandQueue (context, device_id, 0, &error);
 if (error != CL_SUCCESS) 
  return CL_ERROR;


STEP 3 - Create the OCL Objects (Program, Kernel and Memory)


This step is pretty straight forward, since you already defined your problem size just need to set them on the OpenCL Objects:

 cl_int error = CL_SUCCESS;

 //----------------------------------------------
 // cl_program clCreateProgramWithSource (cl_context context, cl_uint count, const char **strings, 
 //          const size_t *lengths, cl_int *errcode_ret)
 //------------------------------------------------
 program = clCreateProgramWithSource (context, 1, (const char **)kernel_src, &kernel_size, &error);
 if (error != CL_SUCCESS)
 {
  return CL_ERROR;
 }
 
 //------------------------------------------------
 // cl_int clBuildProgram (cl_program program, cl_uint num_devices, const cl_device_id *device_list,
 //   const char *options, void (*pfn_notify)(cl_program, void *user_data), void *user_data)
 //-------------------------------------------------
 error = clBuildProgram (program, 1, &device_id, "", NULL, NULL);
 if (error < 0)
 {
  //---------------------------------------------------
  // cl_int clGetProgramBuildInfo ( cl_program  program, cl_device_id  device, cl_program_build_info  
  //   param_name, size_t  param_value_size, void  *param_value, size_t  *param_value_size_ret)
  //---------------------------------------------------
  clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, kernel_size, kernel_src, NULL);
  printf ("\n%s", kernel_src);
 }
 
 //---------------------------
 // cl_kernel clCreateKernel (cl_program  program, const char *kernel_name, cl_int *errcode_ret)
        // "hello_world" is the name of our kernel function inside the external file.
 //---------------------------
        kernel = clCreateKernel (program, "hello_world", &error );
 if (error != CL_SUCCESS)
 {
  return CL_ERROR;
 }

 //---------------------------
 // cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size, 
 //    void *host_ptr, cl_int *errcode_ret)
 //----------------------------
 helloworld_mem_input = clCreateBuffer (context, CL_MEM_READ_ONLY, size, NULL, &error); 
 
 if (error!= CL_SUCCESS) 
 {
                 return CL_ERROR;
 }
  
 helloworld_mem_output = clCreateBuffer (context, CL_MEM_WRITE_ONLY, size, NULL, &error); 
 
 if (error!= CL_SUCCESS) 
 {
   return CL_ERROR
 }


STEP 4 - Set Kernel Arguments

The Kernel arguments must be set in the host code and the sequence of the arguments has to be the same, in our case the inpute and output buffer respectively.

 
 //-----------------------------
 // cl_int clSetKernelArg (cl_kernel kernel, cl_uint arg_index, size_t arg_size, 
 //       const void *arg_value)
 //-------------------------------
 clSetKernelArg (kernel, 0, sizeof(cl_mem), &helloworld_mem_input);
 clSetKernelArg (kernel, 1, sizeof(cl_mem), &helloworld_mem_output);


STEP 5 - Execute the Kernel (Command Queue)


This is the part where all the magic happens. We write the data from host to device , send the start signal to the Device to execute the kernel and finally reads the data back from device to the host. Here is how it is done:

  • clEnqueueWriteBuffer: Write data on the device
  • clEnqueueNDRangeKernel: start the Kernel execution
  • clEnqueueReadBuffer: Reads data back from device

 
 //-------------------------------
 // cl_int clEnqueueWriteBuffer (cl_command_queue command_queue, cl_mem buffer, 
 //        cl_bool blocking_write, size_t offset, size_t cb, 
 //        const void *ptr, cl_uint num_events_in_wait_list, 
 //        const cl_event *event_wait_list, cl_event *event)
 //---------------------------------
 error = clEnqueueWriteBuffer(cq, helloworld_mem_input, CL_TRUE, 0, size, input_data_buffer, 0, NULL, NULL);
 if (error != CL_SUCCESS) 
  return CL_ERROR
 
 //-------------------------------
 // cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue, cl_kernel kernel, 
 //        cl_uint work_dim, const size_t *global_work_offset, 
 //        const size_t *global_work_size, const size_t *local_work_size, 
 //        cl_uint num_events_in_wait_list, const cl_event *event_wait_list, 
 //        cl_event *event)
 //---------------------------------
 error = clEnqueueNDRangeKernel (cq, kernel, dimension, NULL, &global, &local, 0, NULL, NULL);
 if  (ret == CL_SUCCESS)
 {
  //------------------------------------
  // cl_int clEnqueueReadBuffer (cl_command_queue command_queue, cl_mem buffer, 
  //        cl_bool blocking_read, size_t offset, size_t cb, 
  //        void *ptr, cl_uint num_events_in_wait_list,
  //        const cl_event *event_wait_list, cl_event *event)
  //----------------------------------------
  error = clEnqueueReadBuffer(cq, helloworld_mem_output, CL_TRUE, 0, size, output_data_buffer, 0, NULL, NULL);
 }
 else
  return CL_ERROR

STEP 5 - Clean Up the OpenCL Objects


To prevent memory leak or any other issue, the CL objects must be cleaned:

 
 clFlush( cq);
 clFinish(cq);

 clReleaseContext(context);
 clReleaseProgram(program);
 clReleaseCommandQueue(cq);
 clReleaseKernel (kernel);
 clReleaseMemObject (helloworld_mem_input);
 clReleaseMemObject (helloworld_mem_output);


Final Considerations


The sample application presented on this post meant to demonstrate how to create a basic OpenCL based application and it can gets complicated as much as you need it.

The complete and functional source code can be found here !

For information related to the OpenCL EP API, access the Khronos website here.

I hope the information shared in this post can help and give you some directions in your future projects.


EOF !