Overview of oneAPI DPC++ Programming | oneAPI  | Intel Software

Overview of oneAPI DPC++ Programming | oneAPI | Intel Software


Hi. I’m Sravani In this video
I give an overview of DPC++ programming, and cover
basic data types, the execution and memory
models, lambda kernels, and error handling. Make sure to watch the
introduction to DPC++ programming video in the
links for terminology of heterogeneous platforms
used in this video. First, let’s talk about the
basic data types of DPC++ starting with the
device selectors. They are used to select
devices to run kernel code on. The default selector class
chooses the most performant device or the host device if
no accelerator can be found. We also have other predefined
selectors available. Next are queues, which
take in device selector instances as a parameter and
are used to enqueue kernels for offloading. Kernels are
asynchronously scheduled at runtime based on
data dependencies in order to ensure
correct results. Command groups encapsulate the
kernel and data requirements in a lambda. Then there is the
execution model. Each kernel execution results
in work-items and work-groups. A work-item is an instance
of the kernel body and is similar to a CPU thread. Multiple work-items run the same
kernel body code in parallel. A work-group is a
group of work-items, and each work-group uses
the device’s local memory. The global size is the
number of total work-items. And the local size is the number
of work-items per work-group. Next is the memory model. The memory subsystem is divided
into global, constant, local, and private memory. Global memory is accessible
to all work-groups and can be read
from and written to. Constant memory is a read
only part of the global memory and is used by the host. Local memory is
for each work-group and is allocated for
kernel execution. Private memory is similar
to a CPU register. It stores variables
created in the kernel and is per work-item. Now let’s go back to data types. Buffers store arrays
of 1 to 3 dimensions. When we pass in a raw
pointer to the data and the dimensions of the array
to the buffers constructor, the buffer instance takes
ownership of the memory. This is called
Resource Acquisition is Initialization, or RAII. When the buffer scope
ends, the buffer instance releases the memory. Accessors access data
managed by buffers and are used to define
read and write operations. Accessors are declared
inside a command group in order to allow the kernel
code to access the data. Next is basic lambda kernels. A single task executes the
kernel using a single thread on the device. On the other hand, parallel_for
executes the kernel body using multiple threads on the device. There are multiple overloads
of parallel_for in DPC++. Hierarchical kernels
are used to specify what to do for a work-group
and for work-item. The number of and size per
work-group are customized. The kernel code within
parallel_for work-group is executed per work-group. And the code within
parallel_for work-item is executed per work-item. There’s also an implicit
barrier after each parallel_for work-item. Here’s an example of
a hierarchical kernel that uses 64 work-groups
of eight work-items each. Next is error handling. There are two types
of errors in DPC++. First, there are
synchronous errors, which are C++ exceptions and
are reported immediately. Second, there are
asynchronous errors. They occur during the
execution of the kernel, which is in the device code
and not the host code. Asynchronous errors
can be caught using an exception handler. Here is an example
of error handling. Now you have a better
understanding of how DPC++ programming functions. Make sure to check out the links
for more resources about DPC++ programming. And don’t forget to download
the oneAPI toolkit to try it out for yourself. Thanks for watching. [INTEL SIGNATURE MUSIC]

Leave a Reply

Your email address will not be published. Required fields are marked *