Introduction to Intel® oneAPI DPC++ Programming | oneAPI | Intel Software

Introduction to Intel® oneAPI DPC++ Programming | oneAPI | Intel Software


Hi. I’m Sravani, In this video,
I introduce Intel’s direct programming language,
Data Parallel C++ and the terminology of
heterogeneous platforms. This video will help you
understand the basic blocks of a DPC++ program and the
simplicity of compiling it to run on an accelerator. Data Parallel C++ is a
high-level language designed to target heterogeneous
architectures and take advantage of data parallelism. It enables you to reuse code
across CPUs and accelerators while permitting custom tuning. DPC++ is based on the
familiar standard C++ language constructs, such as templates
and lambda functions. In addition, it
incorporates SYCL Spec from the Khronos Group. Intel has established a Data
Parallel C++ open source project to help
drive broad adoption. We also work with the
open source community to drive Intel’s
Extensions into SYCL Next. Now let’s take a look at the
building blocks of a DPC++ program. Before talking
about the code, I’d like to quickly go
over the terminology of heterogeneous platforms. It consists of a host,
which is usually a CPU. The host runs your application
and offloads the kernels to an accelerator using an
API or direct programming. Then we have the devices,
which are accelerators executing the compute intensive
parts of the application. In other words, it
runs the kernel. Together, the host
and the device can make up a heterogeneous
platform or platforms. For example, the CPU with the
GPU can make up one platform. Now let’s look at a DPC++
program which performs an addition of two
vectors on the GPU. Step 1 is to include the
language headers and namespace, providing the templates and
class definitions to interact with the runtime library. Step 2 is to create buffers
which will allocate the memory to store the data. Step 3 is to create a command
queue for specific devices using a device selector. The device selector can be
a default selector or a CPU or GPU selector. If the device is not
explicitly mentioned during the creation
of the command queue, the runtime selects one for you. However, it’s a good practice
to specify the selector to make sure the right
device is chosen. For step 4, you can now
start submitting the work to the device by calling
the submit method and provide this
method a lambda object. The next step is
to create accesses to the buffer inside
the lambda object in order to access the
buffer data on the device. This provides runtime
info about how the kernel will work with the buffer data. Step 6 is to then send
the kernel for execution by calling the parallel_for
method to execute them in parallel. The code that is
written in parallel_for will be accelerated by
the accelerator, which is a GPU in this example. And the rest is
executed on the host. The results are copied direct to
C and the destruction of buff C when the scope ends
and the data is copied from the device
back to the host. We can see here everything
is written inside a single complete C++ program that is
needed to execute the addition of two vectors and
a GPU unlike OpenCL. Now let’s look at the
compilation flow of a DPC++ program. The DPC++ compiler tool chain is
capable of compiling both host and device code into a single
file binary with a single compiler command. To recap, DPC++ is
a single source, heterogeneous programming
model that will help you write portable code across
different target accelerators. Watch the overview of
DPC++ programming video. Access Intel’s open
source project for LLVM, and download the DPC++ compiler
from the oneAPI Toolkit in the links. Thanks for watching. [INTEL JINGLE]

Leave a Reply

Your email address will not be published. Required fields are marked *