Geeks With Blogs
Josh Reuben


OpenCL is a GPGPU API that abstracts over acceleration devices (be they CPU, GPU or FPGA) to provide

data-parallelism (as well as task-parallelism) behavior.

heterogeneous portability is achieved by avoiding high level abstractions and exposing the hardware in a context that explicitly defines its work scheduling capabilities.

An OpenCL application consists of two parts:

  • the host program that runs on the CPU - API functions to discover devices and their capabilities & create a context, create command-queues, define memory objects, & submit commands to the queue, optionally with sync constraints

  • kernels (typically simple functions that transform input memory objects {work items} into output memory objects) that run on the GPU. Defined in the OpenCL programming language: an extended subset of ISO C99.

The Host Program

specifies the context for the kernels - defines the NDRange (a 1 to 3 dimensional hierarchical process space) and the command queues that specify how and when the kernels execute on devices and access memory objects. A kernel program object is loaded / generated & built at runtime within the host program. Memory objects are explicitly defined on the host and moved between the host and the OpenCL devices.

Command queues communicate between the host and a device - 3 types of commands:

  • 1) Kernel execution

  • 2) Memory object mapping and data transfer of kernel IO params

  • 3) Synchronization constraints on execution order

Command Queues are Async - When the kernel has completed its work, memory objects produced in the computation are copied back to the host. Default is In-order execution - serialized execution order of commands in a queue, but synchronization commands can be used to control execution order of multiple kernels to achieve load balancing.

Memory Objects

2 types of memory objects:

  • 1) buffer objects - a contiguous block of memory - A programmer can map data structures onto this buffer and access the buffer through pointers --> flexibility.

  • 2) image objects - storage format optimized to a specific OpenCL device. opaque - OpenCL provides functions to manipulate images, but contents are hidden from the kernel program. Can address subregions as distinct memory objects.

memory region hierarchy:

  • Host memory

  • Global WorkGroup memory: R/W access to all work-items in all work-groups. consistent at a work-group barrier.

  • Local WorkItem memory: shared by all work-items in that work-group. values seen by a set of work-items are guaranteed to be consistent at work-group synchronization points.

  • Private WorkItem memory: not visible to the host. cannot be reordered.

memory consistency

  • memory values are NOT guaranteed to be consistent across all work-items at all times.

  • When all the work-items associated with a kernel complete, load / store ops for the kernel's memory objects are completed before the kernel command is signaled as completed. For the in-order queue, this is sufficient for consistency. For an out-of-order queue - 2 synchronization point mechanisms: 1) force consistency at specific synchronization points such as a command-queue barrier. 2) explicitly manage consistency through event mechanisms

OpenCL can interop with OpenGL:

  • sharing a device context

  • sharing memory objects

  • signaling events

The Kernel Programming Language

An extension of C99.

Features removed: recursion, pointers, bitfields

features added:

  • Address space qualifiers

  • Vector types and operations

  • A large set of built-in functions

  • Atomic functions for unsigned integer and single-precision scalar variables in global and local memory

am embedded profile version relaxes its floating-point standards & relative errors and makes double support optional.

A Typical OpenCL Host Code Flow:

Discover & probe component hardware capabilities:

  • clGetPlatformIDs() - query  available platforms and select one (eg 1st) cl_platform_id
  • clCreateContextFromType() - create a cl_context from a cl_context_properties array containing a cl_platform_id
  • clGetContextInfo() - query CL_CONTEXT_DEVICES to get the size_t of the cl_context device buffer array
  • create a new cl_device_id array pointer to allocate memory for the devices buffer array according to the size_t
  • clGetContextInfo() - call again with the size_t and the array pointer to populate the buffer

Create the blocks of instructions (kernels) that will run on the platform:

  • use standard C ifstream to open a .cl kernel source file for reading,  ostringstream to read a  char stream from file via rdbuf() and convert the stream to a char array via good old c_str()
  • clCreateProgramWithSource() - create the cl_program against the cl_context & cl_device from the source code string
  • clBuildProgram() - compile it
  • clCreateKernel() - Create a cl_kernel from the cl_program

Set up and manipulate memory objects involved in the computation:

  • create 3 vanilla float arrays for kernel IO data mapping (eg 2 in, 1 out)
  • create a cl_mem[3] array  of memory buffers.
  • clCreateBuffer() x 3map the buffers: Copy the arrays into memory objects allocated in device memory to be passed to the kernel where they can be accessed. input buffers are created CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR and mapped & initialized from host arrays, while output buffers are created CL_MEM_READ_WRITE and initialized NULL. Create a cl_mem[3] array of memory objects (2 in to be populated from the float arrays, 3rd to be populated when the kernel runs) against the cl_context to be used as arguments to the cl_kernel

  • clSetKernelArg() x 3 - Set the cl_kernel arguments from the cl_mem[3] array

Execute the kernels (optionally with task parallelism synchronization) and collect the results.

  • clCreateCommandQueue() Create a cl_command_queue on the first device available on the cl_context& get a cl_device_id output param
  • calculate processing space extent - size_t global and local workgroup dimensions

  • clEnqueueNDRangeKernel()  - enqueue the c_kernel in the  cl_command_queue, specifying the processing space dimensions
  • clEnqueueReadBuffer() - Read the output buffer (the mapped cl_mem ) back to the Host from the  cl_command_queue
  • check the results in the output array !

A Canonical 2-in 1-out kernel .cl file:

__kernel void sum_kernel( __global const float *x, __global const float *y, __global float *retval){

    int g_id = get_global_id(0);
    retval[g_id] = x[g_id] + y[g_id];

OpenCL Programming Idioms

There are 2 common programming idioms found in OpenCL coding:

  • 1) in C style, most cl_X functions return a cl_int error code (or success code in the case of CL_SUCCESS).

  • 2) context query operations work by populating single item arrays and generally require 2 calls to clGetContextInfo() for a specific context code: the first to retrieve the array size_t for constructing the appropriate storage space to hold the item, the second to populate the single item array – thats how it is !

Posted on Saturday, February 7, 2015 10:20 PM Parallelism | Back to top

Comments on this post: OpenCL - An Overview

# re: OpenCL - An Overview
Requesting Gravatar...
Great information! This is a timely idea for me. I can use this in my recent project. - Mark Zokle
Left by Michael Douglas on Dec 20, 2016 6:11 PM

Your comment:
 (will show your gravatar)

Copyright © JoshReuben | Powered by: