The purpose of this post is not make you a new OpenCL
expert, but provide you the basic knowledge to take advantage of the i.MX6’s
GPGPU support and get your code (or part of it) accelerated by its Graphics
Processing Unit.
First of All, what is GPGPU and OpenCL ?
GPGPU:
• Stands
for General Purpose Graphics Processing Unit
• Algorithms
well-suited to GPGPU implementation are those that exhibit two properties: they
are data parallel and throughput
intensive
• Data
parallel: means that a processor can execute the operation on different data
elements simultaneously.
• Throughput
intensive: means that the algorithm is going to process lots of data elements,
so there will be plenty to operate on in parallel.
• Pixel-based
applications such as computer vision and video and image processing are very
well suited to GPGPU technology, and for
this reason, many of the commercial software packages in these areas now
include GPGPU acceleration
OpenCL
• Open
Computing Language (OpenCL) is a framework for writing programs that execute
across heterogeneous platforms consisting of central processing units (CPUs),
graphics processing units (GPUs), digital signal processors (DSPs),
field-programmable gate arrays (FPGAs) and other processors.
• OpenCL
includes a language (based on C99) for writing kernels (functions that execute
on OpenCL devices), plus application programming interfaces (APIs) that are
used to define and then control the platforms
• OpenCL
provides parallel computing using task-based and data-based parallelism.
• OpenCL
is an open standard maintained by the non-profit technology consortium Khronos
Group.
• Apple,
Intel, Qualcomm, Advanced Micro Devices (AMD), Nvidia, Altera, Samsung, Vivante
and ARM Holdings have adopted it.
There are A LOT of OpenCL tutorials on the web explaining
all its concepts and capabilities. Below you will find only the most important
ones:
Introduction to OpenCL
• In
order to visualize the heterogeneous architecture in terms of the API and
restrict memory usage for parallel execution, OpenCL defines multiple cascading
layers of virtual hardware definitions
• The
basic execution engine that runs the kernels is called a Processing Element
(PE)
• A
group of Processing Elements is called a Compute Unit (CU)
• Finally,
a group of Compute Unit is called Compute Device.
• A
host system could interact with multiple Compute Devices on a system (e.g., a
GPGPU and a DSP), but data sharing and synchronization is coarsely defined at
this level.
• Each
item a kernel works on is called a 'work item'.
• A
simple example of this is determining the color of a single pixel (work-item)
in an output image.
• Work-items
are grouped into 'work-groups' , which are each executed in parallel to speed
up calculation performance
• How
big a work-group is depends on the algorithm being executed and the dimensions
of the data being processed (e.g. one work-item per pixel for a block of pixels
in a filter)
OpenCL runs in a 'data parallel’ programming model where
the kernels run once for each item in an 'index space‘. The dimensionality
of the data being processed (e.g., 1, 2, or 3 dimension arrays; called NDRange
or N-dimensional range).
Freescale’s i.MX6Q/D GPU (GC2000) OpenCL EP features
• Vivante
GC2000 GPGPU capable of running OpenCL 1.1 Embedded Profile (EP)
• OpenCL
embedded profile capabilities (that means for instance no atomic variables,
does not mandate support for 3D Images, 64
bit integers or double precision floating point numbers)
• 4xSIMD
cores (vec-4) shader units
• Up
to 512 general purpose registers 128b each for each of the cores
• Maximum number of instructions for kernels is
512
• 1-cycle
throughput for all shader instructions
• L1 cache of 4KB
• Uniform registers 168 for vertex shader and 64
for fragment shader
• Single integer pipeline/core
• In
OpenCL Embedded Profile, the requirements requirements for samplers are
reduced, with the number of samplers decreased from 16 (FP – Full Profile) to 8
(EP), and the math precision (ULP) is slightly relaxed below the IEEE-754
specification for some functions
• Lastly,
in OpenCL EP the minimum image size is reduced to 2048 (from 8192 in FP) and
the local memory requirement is reduced to 1KB (from 32KB in FP)
Each of the shader cores function as a CU. The cores are a
native Vec4 ISA, thus the preferred vector width for all primitives 4.
Code Optimization for Freescale’s i.MX6Q/D OpenCL EP
• Vector
math inputs in multiples of 4.
− As
mentioned previously, the GC2000 in i.MX 6Q is a vec4 floating point SIMD
engine, so vector math always prefers 4 inputs (or a multiple of 4) for maximum
math throughput.
• Use
full 32 bit native registers for math.
− Both
integer and floating point math is natively 32 bit. 8 and 16bit primitives will
still use 32 bit registers, so there is no gain (for the math computation) in
going with lower sizes.
• Use
floating point instead of integer formats
− 1x
32-bit Integer pipeline (supports 32-bit INT formats in hardware, 8-bit/16-bit
done in software)
− 4x
32-bit Floating Point pipeline (supports 16-bit and 32-bit FP formats in
hardware)
• To maximize OpenCL compute efficiency, it
is better to convert integer formats to floating point to utilize the four (4)
parallel FP math units.
• Use
16-bit integer division and 32-bit for other integer math operations
− For
integer math (excluding division), there is one 32-bit integer adder and one
32-bit integer multiplier per core. If integer formats are required, use 32 bit
integer formats for addition, multiplication, mask, and sin extensions.
− Try to minimize or not use 8-bit or 16-bit
integer formats since they will be calculated in software and the 32-bit INT
ALU will not be used.
− Integer
division: iMX 6Q hardware supports only 16-bit integer division, and software
is used to implement 8-bit and 32-bit division.
− It
is better to use 16-bit division if possible. There will be a performance
penalty if 32-bit division is used.
• Use
Round to Zero mode
− Floating
point computation supports “round-to-zero” only (round-to-nearest-even is not
required for EP, if round-to-zero is supported).
• Data
accesses should be 16B
− For
the most efficient use of the GPGPU’s L1 cache.
− Work-group
size should be a multiple of thread-group size.
− Work-group
size should be an integer multiple of the GPU's internal preferred work-group
size (16 for GC2000) for optimum hardware usage.
• Keep
Global work size at 64K (maximum) per dimension
− Since
global IDs for work-items are 16 bits, it is necessary to keep the global work
size within 64K (65,536 or 216) per dimension.
• Conditional
Branching should be avoided if possible
− Branch penalties depend on the percentage of
work-items that go down the branch paths.
This post is long enough for just an “introductory”
information about i.MX6Q/D OpenCL EP, for more information including a sample
application, take a look on this good white paper provided by Freescale: https://community.freescale.com/docs/DOC-100694
EOF !
Any idea, why the OpenCL local memory is not mapped to HW's local memory (even if it is just 1k anyway)?
ReplyDeleteFrom Vivante's OpenCL EP documentation:
Delete-----------------------------------
Using local memory typically is an order of magnitude faster than accessing host memory through global memory (RAM). However, execution cores do not directly access local memory; instead, they issue memory requests through dedicated hardware units. When a work-item tries to access local memory, the work-item is transferred to the appropriate fetch unit. The work-item then is deactivated until the access unit finishes accessing local memory.
Select Vivante cores include local storage registers in the hardware Local storage registers are 16 bytes each and are shared across all work items within a work group.
The total number of local storage registers used by a work group will determine the total number of work groups that can be run concurrently in GPGPU. Having more work groups, allocated concurrently in the GPGPU, generally provides better throughput.
-----------------------------------
regards,
Andre
Thanks for you answer, Andre! I much appreciate it!
ReplyDeleteHowever, do I understand correctly, that using local memory in OpenCL program running on GC2000 would benefit from it? I recall from some site, that it was not advisable to use local memory as it basically used the system memory in the background.
Where can I find this Vivante documentation? I tried to request documentation from Vivante at some point, but received nothing in answer. I feel I'd definitely should've seen that documentation before I made image signal processing pipeline with OpenCL on GC2000 :)
And I should've read your blog to know that image2d was not going to make things fly... :)
Sorry, but I can't share the documentation, it is Vivante's confidential, but you can download the Graphics user guide at NXP website, it has a lot of information.
ReplyDeleteRegards,
Andre
Sorry, but I can't share the documentation, it is Vivante's confidential, but you can download the Graphics user guide at NXP website, it has a lot of information.
ReplyDeleteRegards,
Andre
I can't seem to find the document from NXP (I do have account there) by searching Graphics User Guide i.MX6. I've found Graphics Development on the i.MX 6 Series doc with Vivante optimization tips for OpenGL, OpenVG and OpenCL. However, it is bit short on hard details, though helpful.
ReplyDeleteCould you give the document number that I can use to find it in NXP, please?
Ok, I've found the document. IMX6GRAPHICUG. I have to study the doc.
ReplyDeleteThanks for your help, Andre!
Awesome !! I had the document in my hands but I am sad I couldn't share it. I am glad you found a version that was avaiable in public !
DeleteCheers !