RubyConf 2017: High Performance GPU Computing with Ruby by Prasun Anand

RubyConf 2017: High Performance GPU Computing with Ruby by Prasun Anand


(upbeat music) – Hi, I am Prasun Anand
and I’m here to talk about high performance GPU computing with Ruby. I’m really glad to be
here and I really thank the RubyConf organizers
for having me here. Very few people realize that
even the modest computers today have very high powered
GPUs that can be used in parallel with the CPUs
or in serial with the CPUs to deliver really awesome performance. So in this talk I’d like
to talk about two Ruby gems that I have created in the last year. One of them is ArrayFire gem,
and another is RbCUDA gem. What these libraries do is they help you accelerate your code for number crunching for scientific programming and gain performance
improvements just by adding a few lines of code, maybe
four or five per operation. Before we delve into the
topic, let me introduce myself. I am a SciRuby contributor,
SciRuby also stands for Ruby Science Foundation. What we do is we create Ruby
gems for scientific computing. I worked as a Google
Summer of Code student for the Ruby Science
Foundation in 2016 and 2017. Currently, I’m associated
with the GeneNetwork project. What we do at GeneNetwork
is that we create tools for high performance
genome scans on clusters and GPUs and Teslas, even internal device. Recently I have been
awarded the Ruby Grant 2017 by Ruby association to work on RbCUDA gem. These are the projects that I did, first is a JRuby port of NMatrix. NMatrix is a linear algebra library and I just ported it to JRuby. Then I created the ArrayFire gem for 2017 and currently I’m working on RbCUDA. Scientific computing means Ruby has been around for 25 years, but people still don’t prefer it as a go to resource for scientific computing, or solving the problems
or for number crunching. In the last few years
Ruby Science Foundation and others have created gems
for scientific computing. What we do is we handle
very large sets of data for data analysis, machine learning and all. Currently, SciRuby had
the gems like NMatrix, or Daru or Nyaplot. What the NMatrix is for linear algebra, whereas Daru for data analysis, it’s just like Pandas in Python. Nyaplot is a plotting library. Also since Python has a head start, we use Python to solve certain problems which we can’t currently do in Ruby. We have gems like pycall.rb that helps you call Python from your
Ruby code for computing. Arrays and Matrices, when you
have any scientific problem the data that you have the code you may need a array or a matrix. These arrays and matrix are few libraries. for example, if you have
a BLOB that has a matrix of five thousand rows
and five thousand columns which I can say it’s the least amount, to handle these large arrays and matrices you need a specialized libraries. For example, NMatrix helps you
handle matrices on the CPU. What it does is that these
linear algebra libraries that need to handle matrices must be memory efficient and need to be fast. You need fast loops to iterate
through the entire elements of the matrix or the array. Whatever you need to save memory because since your RAM size is limited you need to handle efficiently so that
you don’t run out of the RAM. We have BLAS and LAPACK
libraries what it does it that BLAS and LAPACK
libraries are FORTRAN libraries that help you do matrix
computation by harnessing the multi code support of the CPU’s. Whenever you need to
do scientific computing or number crunching in C
you need BLAS and LAPACK or ITIL libraries other libraries. Since BLAS and LAPACK
are FORTRAN libraries, we have C bindings
sorted and NMatrix calls these C bindings for linear algebra. Similarly, Numo is another
package that provides and area that does the
same thing and NMatrix they provide almost the
same functionalities. Let’s move on to GPU computing because GPU computing is not easy. It means for a beginner if you are trying to do a GPU computing on C
you need to handle pointers. Currently GPU computing is
done using writing kernel codes where you write C type of kernel code in .cu file or .cl. What you do is that you
when you have this kind of files, you compile it then you inject that code into the GPU hardware. Then you need to handle the
pointers that you created and perform operations on it. So CUDA and OpenCL. These are the two platforms that we currently use for GPU computing. While CUDA is limited to Nvidia hardware and it’s a proprietary
solution, it can run all the other vendors
like from like the GPUs from AMD or Intel. Whereas the OpenCL stands for
for open computing language and it runs on of course all GPU hardwares regardless of the vendor. In my experience I’ve seen that CUDA has a better performance over
OpenCL even on Nvidia driver. Nvidia GPU’s, so here comes ArrayFire. ArrayFire is a C library that is used for general purpose GPU computing. What it does is that it’s an abstraction over and add-in where you create an Af_Array
that’s on the GPU device and here you don’t have to bother about what kind of hardware you are using. Whether a GPU which is
from Nvidia, AMD or Intel. or what should you use whether
CUDA is more to your needs or OpenCL, it just tries to give you the best performance. Recently, ArrayFire also supports CPU. For example, in case you don’t have an access to a nice GPU on your machine. You can just use ArrayFire
it will automatically try to run that code,
the same code on the CPU. Also ArrayFire has the wrappers
in Python, Go, Julia, Rust and what I did was I created
a Ruby wrapper for ArrayFire and it really makes our work easy. This is how you create an Af_Array. It means Af_Array storage adding. It can be up to four dimensions. This line I show you how
to create an Af_Array for two dimensions, the with the row site. So the highlighted syntax
there shows Af_Array which has a dimension of
two and the next argument 2,2 is the size of the elements, you have two rows and two columns. Next is the elements, 1,2,3,4 so after that when you create a matrix or
an Af_Array using this code you get the elements that’s shown below. This is columetric format,
so you can see that one then two then three then four. Next we try to add the array
to itself and store it in B. This is the code that shows it. So it represents 1,2,3,4 added to itself gives you 2,4,6,8. Next is a matrix multiplication. If someone is unfamiliar with data science it means we use matrix multiplication most of the time in our code
for number crunching. Means here we have two
arrays called left and right. One of the arrays is of dimension 3,3 while the other is 3,2. Next is we do matrix
multiplication as simple as this. So how we implemented it. What we do is I create an AF
structure type called afarray. Then in the next highlighted line of code I just cast these values from the Ruby VMI information from num to
a double letter type. Next, is that I create a
Af_Array using af_create_array provided by ArrayFire. It just copies the host
array data into the GPU data. Using GPU computing you
just can’t get access to data directly that means
when you create the data you initially create an
array on the host device that is the CPU, then copy that array from the CPU from the GPU and then on the GPU you pass the cardinal code
that interacts with that array and when you get the final
result you then just copy that data back from GPU to the CPU. But in the case of
ArrayFire you don’t have to worry about that because
it just extracts that makes it as simple as this
when you created an Af_Array. The next example what I
do I just get that pointer and I do a mathematic
multiplication operation. For example, in the first highlighted line we have created an afstruct left. Then we also created an afstruct result. We allocated the device
memory the memory to it. Next we call the af_matmul
API, what it does is that it takes the device
pointer of left and right and then does multiplication of it. Then stores it into the result. So these are the BLAS functionalities and the LAPACK functionalities. BLAS functionalities are
Matmult and Transpose. Whereas LAPACK functionalities
are determine on calculation, inverse calculation,
calculating for the norm, the Qr factorization,
Cholesky factorizaton. Svd factorizaton and lu factorizaton. ArrayFire also provides you with APIs for calculating mean, median or variance along different dimensions of your matrix. These are provided by af_mean,
af_median and af_variance. Next, let’s come to the benchmarks. If ArrayFire so how if
this really provides you high performance how
it executes your code, so I ran the benchmarks
on AMD FX 8350 processor and Nvidia HTX 750Ti GPU which is off the Maxwell architecture,
the recent one is Pascal. But it’s a decent GPU
and we used double dtype and with the coded backend. So from here calculating
the matrix determinant it takes around, in this graph on the x-axis we have the number of
elements in a matrix. Whereas on the y-axis we
have the computation time that it took for us to do that operation. The lower is the computation time, the better is the performance. We are comparing NMatrix-LAPACK-Ruby and NMatrix-JRuby and ArrayFire. NMatrix-JRuby I what I created, which is a JRuby port of NMatrix
and NMatrix-LAPACK-Ruby uses LAPACK for matrix calculation. So, in this case NMatrix-LAPACK-Ruby takes around 12 seconds for
determinant calibration. Whereas ArrayFire takes
around two seconds. So we have an improvement of 10X. Means ArrayFire is faster than
NMatrix-LAPACK by 10 times. So we did a nice job. Same goes for NMatrix lu_factorization. That means when you do lu_factorization the next step is that you
can calculate the determinant from the diagram element. So this benchmark is exactly the same as matrix determinant calibration. So for matrix addition
we have this benchmark where NMatrix-Ruby
takes around six seconds whereas ArrayFire takes
around 0.004 seconds. Means 400 microseconds. So the performance
improvement is 10,000 X. (audience applauding) Matrix subtraction is
similar to matrix addition Because both are element-wise operations. Instead of just adding the two
elements and subtracting it. So the exact same figures. Now comes matrix multiplication. Means at the crux of any
scientific computing code we have matrix multiplication, that means we call it a lot of times. In this case, NMatrix-Ruby has two ways how you can call this BLAS routine for matrix multiplication. You can either use a
NMatrix-BLAS-Ruby or NMatrix-Ruby. NMatrix-BLAS-Ruby is faster
because it uses FORTRAN whereas NMatrix-Ruby in C code. So in this case, NMatrix-BLAS-Ruby
takes around 31 seconds, whereas ArrayFire takes 0.00062 seconds, so 620 microseconds. The performance improvement is 100,000X. So coming to all this means
when you use ArrayFire you don’t have to worry about what kind of GPU hardware you are using
means you write your code without worrying about
whether you are going to run that code on a CUDA
device or a CUDA platform or an OpenCL platform, or
an Nvidia GPU to an AMD GPU. It just tries to give you the
best performance out of it. Means you can also tune
it according to yourself. Next, since Nvidia has
a better performance on GPU devices, so what
did was since ArrayFire is an abstraction I
tried to create something that wold be even closer
to the GPU hardware. So for that I created another
project called RbCUDA. Means it runs only on Nvidia devices. Since ArrayFire was very easy because we didn’t have to worry
about transferring the data from the CPU to the GPU or vice versa. Here we need to handle everything you need to take care of how you created the GPU array pointer and then how we copy from the CPU to GPU and see
that if we have the pointer which is not garbage collected or not. So what we do is we
created a generic pointer that is void stat, which
means, it just stores the device array location in the VM. Then you copy memory from CPU to GPU, and yes it has been interfaced
with NMatrix and NArray. It means you just add one line of code that is equal to NMatrix has two GPU and you let the GPU point it. Similarly we can do with NArray but it’s under development right now. So this is an example of the code. When you have created your program and you think that you can create more optimizations on it,
you might be interested in running your custom
kernel code on the GPUs. So, RbCUDA helps you do that. Means so you could run your custom kernels on the GPU’s in ArrayFire
but as with RbCUDA we have created a bridge that can help your custom kernel on the GPU’s. So this is how our kernel code looks like. We have a blockIdx.x means it just refers to one element in the
block and when we call this we have two arrays in *a and *b. We add these two arrays
and we store it in c. So what we do here is
that RbCUDA is different from running the CUDA
kernels on the c is that you can run this kind of code line so you are running your code pry and you can just inject this kind of code. So what I do is that I
take this kind of code I store it in a temp file,
and then I compile it using Nvidia code compiler
and then as a result I get a .pdx file which
can be run on Nvidia GPU. So this is the code, it’s
tough to understand but. So also, running a custom CUDA code was already done another
Ruby gem call SGC-Ruby-CUDA. But what it lacked was it
doesn’t provide solutions for other libraries like
it didn’t have support for CuBLAS, CuSolver and CuRand. So in RbCUDA we’ll have
support for all these means we’ll have ready made
routines for BLAS and LAPACK it means you can do matrix multiplication and even matrix decomposition, and you can also create random numbers using other engines random engines. So these are the benchmarks,
means again the benchmarks are done on this AMD FX
8350 octacore processor, Nvidia GTX 750Ti GPU and Double Dtype. So, for matrix multiplication
you can see that the lowest line RbCUDA is even faster. Means NMatrix-BLAS-Ruby
takes you around 31 seconds ArrayFire takes around 0.0006 against, whereas RbCUDA takes you 0.0004 seconds so we have a performance
improvement of one million times. So here comes the future work, means ArrayFire as being a GPGPU library, a general purpose GPU computing library provides you ready made
routines for image processing and also helps you write
classifiers for machine learning. So I’ll be working on these
APIs and even indexers and currently only double
datatype is supported so in the future with
are going to have support for complex floats, et cetera. Now ArrayFire is in active development. It’s being kindly funded
by Ruby Association. Contributions are welcome. You can check out these
repos and benchmark code can be found on ArrayFire and github.com/prasunanand/arrayfire-rb
benchmarks. You can try it on your machine. Since I ran these benchmarks
on Maxwell architecture that is 750Ti GPUs when
you run it on Pascal GPUs that is 1050, Nvidia 1050
series, you can expect a performance of even 10 times more. Now acknowledgements I would
like to thank Pjotr Prins, he’s involved with my Ruby project and other projects in D and Scala. And next is Pradeep Garigipati he’s a core contributor of ArrayFire. I would also like to thank
SciRuby, Google Summer of Code, and Ruby Association for
helping me continue my work in the field of open source. Thank you. (audience applauding)

Leave a Reply

Your email address will not be published. Required fields are marked *