Grid Computing With Dataflow: UBS Tries Option Pricing as an Example Use Case (Cloud Next ’18)

Grid Computing With Dataflow: UBS Tries Option Pricing as an Example Use Case (Cloud Next ’18)


[MUSIC PLAYING] REZA ROKNI: Good
morning, everybody. Thank you very much for
coming along and giving us some of your morning. My name is Reza Rokni. I am one of the
solution architects here at Google, currently based
out of our Singapore office. And part of the
solution architect role is to try and find new
and interesting use cases to apply for our technologies. And one of the things
I’ve been able to do is work with a few
banks, including Neil, and Reuven, who I’m
going to introduce now. So Neil, if you want
to introduce yourself. NEIL BOSTON: Hi,
I’m Neil Boston. I’m head of IB
Technology for UBS, and I also run UK Technology. So I run wealth management and
asset management out of the UK as well. REZA ROKNI: And Reuven? REUVEN LAX: My
name is Reuven Lax. I’m a software engineer
on Google Cloud Dataflow. REZA ROKNI: And so the
project that we’ve been doing is making use of data flow,
which is our batch and stream processing system to
do grid computing. And we’ve done some
pretty interesting work with building out some demos and
some near realistic scenarios. But to just describe
what grid is and then some of the challenges
and opportunities, I’m going to handover to Neil. NEIL BOSTON: Thanks, Reza. So I’ll just talk through
maybe three or four slides very briefly. And it’s from a very
personal perspective as well, so if you
agree or disagree, please come and have
a chat afterwards, and we can debate
some of the points. What is grid computing? I guess, to me– not to read the words
exactly, but basically having a large cluster with
a set of nodes that we’re going to do some kind of
analytic calculation with, is generally the way
that I approach it. Now that could be
deterministic stuff that I’ve coded in myself,
could be elements of ML as well. But basically, I look
at it as a large cluster of calculation agents that
are doing some set of tasks. Generally, hopefully you can
make them relatively stateless and do it in a relatively
independent way, so when you aggregate
and bring the data back, that could be thought
up and thought out in a relatively
straightforward manner. So as with most
things in my life, I started with three wishes. It’s not asking for
infinitely many wishes. It’s not what a wish is. Or a large but finite amount
of wishes is one of the wishes, I guess. So when I approached
and talked to Reza and we were discussing
this, I guess in London, it was actually in a pub when we
first started talking about it. REZA ROKNI: As all good
engineering projects in London are. NEIL BOSTON: As all good
engineering projects in London are, near a place
called Farringdon, which was very good,
I started writing down what my users’
stories were, but they became three wishes basically. So I wanted to write single
threaded simple calculation class as functors or packages. I didn’t have to worry about
things like multi-threading, for example. I wanted to stop thinking
about creating my own directed acyclic graphs,
bitemporal or not. I just basically wanted to
focus on my IP and my analytics in a very simple way. But I like to be able
to tweak my experiments and run my experiments in
a much more seamless way, rather than having to
recompile and rebuild a lot of my analytic classes. So I wanted some environment
I could basically drop my functors inside,
run a set of experiments, get set of results out,
and do that repeatedly, either manually or through a
set of machines, potentially. So three simple wishes, I guess. Challenges. I guess, I’ve been in markets
technology for 150 years now, so a long time. So quants, I wanted
to really focus on my quantitative analysis,
people writing maths. I just wanted to focus on
the maths and the outcomes rather than thinking
about this whole fabric. So over the years I’ve
been in markets technology, you can spend 20% of your time
thinking about the mathematics and 80% of your time thinking
about how you make this thing scalable, how you make
it distributed, how this thing can scale. And actually, one of the
challenges as well is you can spend a lot of your time– I guess my background is C++– is trying to optimize your
C++ code to run on your PC, which is obviously
diminishing returns of scale. So again, I wanted to try
and get away from that. I wanted the people to
focus on the mathematicians and physicists and
engineers, to focus on running experiments,
rather than thinking about the other aspects about
the ecosystem, and much more about the maths, the data
coming back and the outcomes, and analyzing that,
either manually or, again, through machines. And I wanted to be able
to do it very, very fast. So if I didn’t know
which model to choose and I had a category
of models, to be able to get to some reasonable
outcome in a relatively quick way by running
experiments at high velocity was really the challenge. So kind of utopia, I guess,
from a of quant prospective. Opportunities, running it on
demand and also event driven. There are two sides very
much in markets technology. I want to sometimes run things
at very large scale and lots of things, which you’ll see
later in the experiment. But I also have things
coming in at high frequency that I want to
deal with as well. So you can think of as a
bifurcation a little bit, but you don’t want a bifurcation
in ecosystem or fabric. I want to be able
to run a universe of parallel experiments and
then do some assessment of them, either through some
parametrization, through some MLPs,
or again, manually, just looking at these
things in a reasonable way, and then run huge, I
guess, simulations. So if you come from banking
and have a look at the way the market’s gone,
especially with regulation in recent years, running
large Monte Carlo simulations, for example, [? onport ?]
entire portfolios across all the markets that the investment
bank of the markets division has is, again, a
significant challenge. I’ll talk about it
a little bit later, about the organizational
issues that you have around some
of the technology choices the banks make
to try and accommodate some of those solutions. But again, regardless of
the simulation experiment, I wanted a relatively stable
and consistent ETL style environment that I
wasn’t too worried about. Now I’ll hand over to Reza, and
he can talk through Dataflow. REZA ROKNI: Sure. And so this is interesting,
because Dataflow we use predominantly in
things like ETL pipelines. So we’re taking data,
we’re processing it, we are passing that on
to things like BigQuery. So why are we using
this for grid computing? To look at this– and this was
after many conversations when we were sitting down and
storyboarding this stuff and looking at architectures– you kind of have to
deconstruct what a grid is, and then it makes
perfect sense why Dataflow is a very good tool
for solving this problem. So what is a grid? First of all, it’s data. It’s a bunch of math
functions, so smart maths folks like these guys. You need lots of
CPUs, and you need some way of scheduling
jobs on those CPUs. So you need to get data
and functions to the CPUs and get them to do the
work, and you need storage. So let’s go through those
in a little bit more detail. And we start getting
into why Dataflow is particularly useful here. So in terms of data, there’s
quite a few data sources. First, there’s internal
data sources, so the trades, et cetera, that are
happening within the bank. There’s a lot of
information there. There’s external data
sources potentially from market environments, all
of these raw underlying data that we need to then
do some processing. And then there’s derived data. So this is data where you’ve
taken underlying information, you’ve done some
processing on it, and you’ve got some new data
that you need to distribute, for example, yield curves
that are iterating based off of things like swaps
and deposit information that you have early on. Now the next piece
around data is that it doesn’t come
in a nice, static file. It’s streamed. It’s continuously being updated. Again, this is starting
to get interesting in terms of what Dataflow
is capable of, which is being able to do stream
processing on that datasets that are coming in. Next, it’s functions. It’s maths functions. It’s a lot of CPUs. One of the interesting
challenges, fun bits, we had is a lot of the quant
code that’s been written in the last 20 years,
I guess, is C++. And Dataflow today does not
support C++ as a standard language. And we had to do some work,
and I worked with Reuven to get Dataflow flow to
run C++ code for us. So that was a little bit of fun. NEIL BOSTON: So I
was just going say, I guess the 3.1 wish is I
didn’t want to JNI into the C++. I wanted to think of it
in a much more, I guess, natural way. So I guess that was my 3.1
wishes, rather than just three wishes. REZA ROKNI: And I found out
from Reuven that actually, C++ is the original
Flume, right? REUVEN LAX: Yes, so
internally at Google, we do run a version
of Dataflow on C++. There was just– at the time
we came out with Dataflow, there wasn’t as much of a
cloud market interested in C++. Most people were interested
in Java or Python or other languages. REZA ROKNI: So we
basically made– retrofitted C++
running on Dataflow, even though the original
FlumeJava was all C++. The other piece around this
is kind of that ETL work. So you’re having to munge data. You have to shape it, enrich it. And then once you’ve
got that data, there’s a place of storing it. You need to then store– pull from storage within
the pipeline itself. You may potentially need to
do output to storage as well. And these grids
have huge fan-out. So I can take a couple of megs
worth of market information, or even, let’s
say, a gig, and it generates terabytes of output. So we need some way
of actually being able to deal with huge fan-out
from these environments. And then there’s one
secret ingredient that actually means that
one of– the testing we’ve done so far has been
very, very productive. And that secret ingredient
is the movement of data. So if you think about the
fact that we’re pulling source information, we’re doing ETL. Then we’re sending that data
to the next stage, which is doing the processing
with that data, so running those
maths functions. And then we’re sending that
data again to the storage. These are all chunks
of data movement. Now with Dataflow,
we can actually create a DAG, a distributed– a Directed Acyclic
Graph, sorry– of computation. And those connections between
all of the nodes, Dataflow takes care of moving
the elements for me. So this is, again,
why we started thinking, actually,
Dataflow is a very good fit for this environment. And when we think about
this movement of data, one thing I noticed, because I
don’t actually originally come from a finance
background, was you see a lot of organizational
structure coming into how the DAG is created. So by that, I mean there’s a
team that does some ETL work, and they generate a file. This file is then moved to the
team that does the grid work. And then the grid work
outputs another file, which is moved to the team
that does the analytics work. And what that is
doing is essentially creating this
organizational structure on top of how the
technology should look, but that’s not
really what you want. What you want is one
continuous operation from beginning to end. And Neil will explain this in
better detail than I could. NEIL BOSTON: So I
guess this slide here, this schematic, it’s not
specific to the banking industry. But the way that
I’ve perceived it, over many industries
over the years, is they tend to design technology
by organizational construct. So what you tend to find
is people within group A, B, and C will actually
have a view about how their world looks. And maybe they have
specific user stories that are giving them some niche
aspect to the design patterns. Well, what they tend to
do is build the ecosystem above and beyond their
own boundary conditions, and then hand over the
data or the information to the next group. So then you end up in data
translation and ecosystem translation, and
then reconciliations and that style of problem. So again– and it hits– like I said, it’s not just
our industry specifically. There’s other industries. But I think it can be that
people then start thinking about technology choices. Rather than actually
thinking about what’s optimal for their solution front
to back or from left to right, more about thinking about how
their organizational construct is driving that. So again, one of
the things that I wanted to challenge a
little bit was that we don’t need to think about– when we think about
it in an evolution or a revolutionary
sense, I guess– is to think about how
we can offer something out that actually crosses those
boundaries naturally and has a very integrated stream
based and batch based approach as well to the problem. REZA ROKNI: So a little bit more
about BEAM now, and a better person than me to
describe it is Reuven, one of the technical leads
who built BEAM, and Dataflow, the [? runner ?] for BEAM. REUVEN LAX: Yeah, thank
you, Reza and Neil. Oh, thank you for the clicker. This use case is very
interesting to me, because as Reza mentioned,
I think a lot of people see Dataflow as an ETL tool. It’s a very common use case to
use Dataflow to get your data into files or into BigQuery. But Dataflow is actually
a programming model, and it’s a programming model
that fits this grid use case very well. You divide your logic into
massively parallel computation with its stateless functions. And then at key points,
you also put aggregations, where you need to bring the
data together to compute a count or to write it out to BigQuery. And the key to this
is, as Reza said, this data movement, which
is one of these things that sounds easy. You just have to
move data around. And it turns out that
writing a system that can move massive
amounts of data, at scale, efficiently,
with strong semantics, never moving an element twice,
never duplicating an element, is surprisingly hard to do. And integrating
that all together is one of the advantages you get
out of Dataflow that makes it work really well for
systems like this. So the way we have Dataflow
designed and presented is as a three layer system. So we have APIs. Currently, we have a Java
API and a Python API, and we have more APIs coming. So there is a Go API,
which is usable today in an early access. There’s also a
SQL API, so you’ll be able to write Dataflow
pipelines entirely in SQL. Then there’s this programming
model I talked about. All these APIs just live on
top of this programming model, and then multiple runners
to run your pipeline. So there’s the
Google hosted runner called Google Cloud Dataflow. This is this fully
managed runner. We just point it at
our cloud and it runs. We also have open-source
runners, one on top of a system called Apache Flink,
one on top of a system called Apache Spark. So your Dataflow
pipelines are not bound to only
running on Dataflow. So this is interesting
in the hybrid cloud case. We’ve also found in the
financial services industry, this has been interesting
to some people for regulatory reasons. If a job needs to
be rerun on prem, if a regulator
requests it, it can be done with one of these
open source runners. So what is the advantage of
the Cloud Dataflow system? So Cloud Dataflow is one of
the runners for this BEAM API. So Cloud Dataflow just presents
this fully managed, no-op serverless solution. So you don’t have to– the SDK gives you a nice, simple
API to develop these things. But then Dataflow
manages security for you. It manages the cluster for you. You don’t have to spin up a
cluster to run your pipelines. It will auto-scale
your pipelines. So Dataflow will attempt to
learn how big of a cluster you need to run your
pipeline, and it will actually shrink and grow this
cluster over time. As you get towards
the end of your job and there’s less
data left, Dataflow will just start
shrinking the cluster and try to optimize cost
as much as possible. There’s also a lot of
integration with TensorFlow based machine learning models. So there are TensorFlow
APIs that you can integrate inside your Dataflow job, and
there’s a suite of TensorFlow based transforms that
you can run via Dataflow. So there’s a suite
called TFX which is a suite of machine
learning transforms for prepping and evaluating
machine learning models. So a classic
example of that is I trained a new machine
learning model, and I run a simple
analysis on it. It says, OK, it’s 5%
better than my last model. Then you deploy it,
and you find out that, oh, it was 5%
better on average, but it was producing
worse results for everybody from the state
of Florida, for instance. This is actually
not a better model. So it turns out
evaluating these models is fairly tricky to make sure that
the model actually is better. TFX provides suites
of tools to do things. That’s one example of the
tools that TFX provide, is good evaluation of models. And this all runs on
top of Cloud Dataflow. So I mentioned that Dataflow
lets you run serverless data analytics, as opposed
to other clusters where you have to run a cluster. So what is the benefit of
serverless data analytics? Well, if you look, this
little chart on the left is traditionally what
you would have to do to run your big data analytics. So there’s a tiny little
piece of your work in which you actually
worked on your analysis and your insights, your
business logic, your functions, the actual quant stuff that
Neil wants his people to spend most of their time working on. And then you would
actually spend about 90% of your time on everything else,
building a monitoring solution to make sure your job is
running, performance tuning, figuring out how many workers
do I need to run this on? How much memory should
I give each worker? What particular type
of worker should I be? Should I run on
four core workers? Should I run on
two core workers? Should I run on 16 core workers? Then making sure my
utilization is up to snuff. Then figuring out a deployment
story and a configuration story. And people would often
spend a huge amount of time coming up with
different deployment stories for these systems. Resource provisioning,
handling growing scale. So I came up with
something that worked, and then suddenly, six
months later, my input data is 50% larger. The thing I came up with
before no longer works. Resetting, going
back to the start, and running through this
whole process again. Reliability,
setting up alerting, making sure that these
jobs actually run reliably and complete, as I expect. The advantage of serverless
is to cut out all of this pie except for the
analysis and insights. Focus on the actual
business logic you want to run
in your pipeline, and let the serverless
system, in this case, Google Cloud Dataflow,
handle all the rest for you. And finally, we’ve been
saying that Dataflow not just a product. It’s a platform. And Google Cloud
platform is the platform that Dataflow is part of. So a platform is actually
not just one technology. It’s a group of
technologies that often act as a substrate to
build higher level systems. So as Reza’s going
to show you here, an interesting thing
here for this problem is not just Dataflow,
but how can I use Dataflow in conjunction
with BigQuery and in conjunction with Google Cloud Storage, and
use all these things together to provide a solution? And since Dataflow flow
has sources and sinks– as we’re going to show here– Dataflow has sources and sinks
to all these other GCP stores and data resources, Dataflow
provides a great substrate to link all of these
things together. So with Dataflow– so you
have sources and sinks. You have an API,
which, in this case, is the Dataflow API, which is
your way of linking everything together, saying
read from this data, run these transforms over
it, write to BigQuery, write to Bigtable, write to
Spanner, write to Pub/Sub, write to whatever
other sink you want to. And your pipeline
is declarative. And this image here is actually
an example of a Dataflow graph, in fact, I believe the one
that Reza will show you in a second of running data. And now Reza? REZA ROKNI: Yeah, so– if I can have the ticket. REUVEN LAX: Oh yeah, sorry. REZA ROKNI: Thank you. So just a couple of things
around the other bits of technology we use. I, when I was doing the
experimentation with Neil, I don’t actually have access
to UBS Bank’s data, obviously. I have to work in
my own environment. And we wanted to make
this as real as possible. I didn’t just want to make
synthetic data, especially for the inputs and
the ETL layers. So what we were
using was BigQuery. Now BigQuery is Google’s
petabyte scale data warehouse in the cloud,
fully managed and serverless. One of the key aspects that
I was making use of here is that it separates
processing from storage, which means sharing of
data is very, very easy. And this meant I
could get access to external data sources. In particular, Thomson
Reuters were very kind enough to provide me data. So they have a store
in their project where they’re putting
historic tick information and current tick information,
including things like swaps, deposits, et cetera, which I
need to use to build things like yield curves. They also, four
POCs, they can put up to 20 years of historical
data on to BigQuery. And I get the benefit
of just instantly being able to use that
stuff without having to move it around projects. So again, thank you,
Thomson Reuters, for helping me do this. If I can– one of
the other pieces that we won’t show you
in the demo right now, but we will be using in
the real thing, is today I’m actually sending even
the raw data to BigQuery, like very granular data, just
so I can show it to you guys. But in the real
production system, we’d send outputs,
the results, which is still many billions of rows
of information, to BigQuery. But the raw data
that we want to keep is going to be many,
many terabytes. That, I would send to Bigtable. Bigtable linearly scales,
and we use it internally within Google for solving
exactly that kind of problem. And it’s perfect for
this environment, because we have many
thousands of cores, all running, generating data. And if we don’t want
to create a bottleneck, we can just use Bigtable
and linearly scale it out, and have all of those
cores just push data directly into Bigtable at
many, many tens of thousands or hundreds– actually,
hundreds of thousands of QPS. So with that, Neil
is going to walk through what the
pipeline’s actually doing in terms of the finance side. NEIL BOSTON: Yeah, so I guess
based on the 3.1 wishes, I set up a very
basic experiment. I mean, I don’t know how
much background you have, I guess, in investment
banking and markets piece, but basically, the most basic
experiment I could think of was the source market
environments or market data. So that could be things
like FX spot or yield curves of volatilities. Take a portfolio,
a set of trades. In this case, we
just randomly created a million option trades. Basically, take that
as our underlying data. Process it, as you can see here. Apply some transforms to the
market data, which is really– you can select a
set of market data. You then have to apply– in this case, we used an
open source quant library to produce a yield
curve functor, I guess, at the end of the day, an
object, from the input. Process the data. I also wanted to
create scenarios. So what I mean by
that is in the market, you tend to have a set
of points coming in. You also want to
shift things around. So you could end up with a
curve with 10 points on it, for example. And I want to shift each
point independently, and I also want to do these
parallel shifts as well. So you can see I wanted to
try and create scenarios on the fly which I
thought would be good just from an experimentation
point of view. Then I wanted to run a set of
analytics on it, some of which I coded in, which is very basic. And then analyze the experiment. So this was kind of the
most basic experiment I could think of. I could choose a different trade
set to make it easier to price and easier to risk. But again, it was just
personal bias and preference that I wanted to try options
from a personal perspective. So based on the 3.1 wishes, this
was the most basic experiment that I thought we
could start with, just to see what that looked like. Scenarios, again, I just chose
very market standard things, non-parallel shifts, weighted
shifts, independent shifts, and parallel shifts, again,
to create a much larger set of output datasets, more in line
with what I see in the market. REZA ROKNI: Thank you. Could we switch over
to the laptop, please? So we have a good problem
to have with a demo. So we had timed this
so that it is actually doing the scenario running while
I switch over to the laptop. Last night I was working
with one of our performance engineers from
Dataflow, and it’s made it a little bit too fast. So what you won’t see
right now is this stage. And I’ll talk through
what that’s doing. You would have seen the
numbers going across here, but it’s already zipped past it. I’ll walk through what
we’re showing with now. So what I have here is
the monitoring interface for Dataflow, so the graph
that I’ve built in code. I’ve submitted a job. At the point I submit the job,
there’s no machines running. Everything is cold. By submitting the
job, Dataflow starts spinning up the number of
workers that I’ve enabled. It starts sending my
code to the environment, and it starts doing all of the
sink and source information that we need. So in particular, here
on the right hand side, we see information
about the job itself. It’s still running,
because it’s now pushing the output to BigQuery. We can see that it spun up 1,000
workers, and in this instance, it’s using 4,000 cores
from our environment. Dataflow does
support auto-scaling, so we can actually
let it bring up the number of machines it
needs and then bring them down. I deliberately don’t
use that, because I just want to make this run as fast
as possible, which introduces some potential inefficiencies. But for this demo, we’re
just like, start with 4,000 and keep going with
the full 4,000. A little bit more information
here, so we have– able to in my code, put custom
counters that actually tells me what’s going on within
the environment. And it is tightly integrated
to things like Stackdriver. So you can actually
have monitoring going on while your
big jobs are running. You can see where
things are getting to. In this particular instance, we
were sending a million trades, and the number of scenario
trade combinations comes out to a 847 million. So not quite the
billion, but that’s our next set of experiments. A little bit more information,
these are some of the options that I am sending in to the job. So if we look at the
DAG itself, so I’m doing some reading
of the information. So there’s the
market data that I’m reading from Thomson
Reuters’ information. We have some swap data. If I expand this out, what
you can do within the code– each one of these
boxes generally corresponds to a transform. And I can collapse logical
things together within my code, and what you see there
is like, for example, we can swap [? OIS. ?] We can
swap [? USDOIS. ?] These are all being put into
a single transform. The way I’ve done
this isn’t how we do a production version of this. We just have it read
swaps, and I just use code to do the
different things you need to do for all
the different data types. But I just used this
for good illustration. So this is the read stage. If I highlight every
single element here, we can actually see
the input and outputs of each one of these stages. Now I’m going to move down and– sorry, I’m not using the
touchpad on this device. One thing that we did as an
experiment that’s worked out quite well is rather
than have the trade flow through the DAG with
introspection to the data sources for all the data that
they need, I reversed this. So what we have is all of
the trades come as a lump, as a side input. I do all the ETL and
processing that I need to do, including scenario
generation, from the top. And then what I do is, because
the introspection would have been lots of long RPC calls– not long, fast– but RPC calls,
which are still expensive, what I do by bringing the trades
and all the scenarios together, I just do a really dumb
massive fan-out with a loop. So what it does, it just
generates huge amounts of data tuples. So the data that I
need to run, which includes all of the values
from the yield curves, for example, which is an array
of two and half thousand, in this case, doubles– I generate all these
values and just let Dataflow do what
it’s good at, which is move lots of data around for
me and distribute that work. So here we have
the trades that are being read coming
in from the side and being added in
as a side input. So if we wanted just to
explain a little bit more on what side inputs are– so if you think of kind
of like a broadcast join, if your data is of
a certain size that will fit in the environment,
rather than do a shuffle join where we’re joining all
the elements together, you just take this– it’s
like having a smaller table. We take that smaller
table and just make it available to
all the environments. And then when my
scenarios are coming in, because all that data’s there,
they’re just looping through. And any trades that happen to
match the currency that I’m looking for come out as a
fan-out of the scenarios. So here we have the
scenarios being generated. I actually do a little
bit of optimization, in that for each piece
of work, we actually make it 100 scenarios that
needs to be processed, because it’s a bit more
efficient in the next stage. Here we have– this
is Neil’s code. So this is C++ code
running in Dataflow. Neil has written a– I’ll let him describe it,
because I have no idea. NEIL BOSTON: Well, I guess
basically, I took the– I don’t know how familiar
you are with options pricing. There’s nothing
particularly smart about it. Black and Scholes
wrote a paper in 1973 on how to price options. They managed to show that
it was like the diffusion equation, which is like a
partial differential equation. So all I did is
write a thing called finite difference for a way of
discretizing and solving that. But ultimately, all I’m really
doing in a C++ is doing matrix, sparse matrix, solving
calculations over time. So it’s nothing particularly
smart if you look at the code. It’s probably the least smart
piece of the whole ecosystem, I guess. REZA ROKNI: Well,
that’s what he thinks. I tried to read it. NEIL BOSTON: I also took
16 days as well to run it. These were the biggest
out of all that as well. REZA ROKNI: Well, that’s where
the real work happens here, right. And this was a good– we use that function
as something as a good indicator of what
a real thing would look like, because it’s doing real work. And this is where
most of the work is happening for
this environment. So up until now,
we’ve been doing ETL. We’ve been setting
up all the scenarios. I’ve been doing these
joins and distributing this massive fan-out
of work, which generates, as we can see from
the output of this, 847,000 results. Now what we do next is we
go to push this information into BigQuery. And if you’ll notice,
the number of elements multiplies quite a bit. What we do is, the code
that Neil has written, he’s got a spot index,
which is actually the thing we really want. But he also has an array of
information, which is the– NEIL BOSTON: Yeah,
so basically, when I do the matrix calculation,
which iterate over time to some answer, you get
basically a vector back, based against
different spot levels. But I actually want
the one in the middle, because that’s the
current spot level. REZA ROKNI: So that’s
the thing we wanted, and that’s the thing we
should send to BigQuery. However, we want to reco– one of the advantages of being
able to use this data platform is at each stage
of the pipeline, I can output results and
go back and look at them. Because I don’t care
about the IO, right. It’s just, Bigtable
will deal with it. So what we do here
is, I’m actually taking all of those
values as well and dumping them into BigQuery. Now I did that so you guys
can see it in BigQuery. In a real production
system, the spot index is what we take to BigQuery,
because that’s what you’re going to do analysis on. The other values I’d
write to Bigtable. So if you wanted to go back
and check your results, you just go to Bigtable. And then Bigtable, you
can do really fast index and range scan lookups. So here I’m writing
14 billion pieces of information,
rows of information, to BigQuery at the end of this. So this is the pipeline running. It’s now finished its work. I can know it’s
finished because– REUVEN LAX: Actually, it looks a
bit closer to 15 billion to me, Reza, 14.9. REZA ROKNI: OK, OK, he’s
the engineer here, right? NEIL BOSTON: And
the mathematician. REZA ROKNI: So– and
the mathematician. I’m outnumbered. So I’m just going
to show the results as they landed in BigQuery. So this point is another point. When this finished,
there was no file that I then now need to load
into my analysis system. When this finished,
it’s all done. I can immediately run
a query with BigQuery against that data. So this is actually
using a different table, but just to show the principle
here, we have grid results. The details is 14 billion
rows of information, which actually equates
to 1.6 terabytes of data. That’s been loaded
in this table. And what I’m going to do–
not particularly smart– I’m just going to say, given
the trade number was 8876, give me the max spot index
across all the scenarios. So when I run
this, it’s actually going through and doing
a query against all the 14 billion rows of information,
and I have my output. So right at the end,
you can actually start exploring your dataset,
or generating the reports that ultimately– we’re not
doing this all for fun, right. The bank is actually
running this and want reports
at the end of it. We can immediately start
running the reports. Another thing that
you can explore with and we started to do– so with Datalab, which is a
notebook that we can do Python coding against the data
sources, because all the intermediate stages
of the calculations are being stored in
BigQuery, what I did is just did some exploration
with a notebook and I’ve stored
it here as a PDF. Essentially, what
we’re doing is looking at all of the information
that’s available for the whole pipeline. So here, this is
some underlying data that I had from the Thomson
Reuters environment datasets. I’ve graphed some
of this information. This is what some of
the underlying data is. I actually had a look at
the yield curve creation to make sure that
stuff’s looking OK. And one of the things– so this is the asset and the
new numery of the things that I will send onto the C++ code. Both are being graphed. One of the interesting things
here is– this is the numery. And if you notice, most
of them are at the top. There’s one weird one. And this was an accident by me. One of the scenarios,
I typed 0.5 as the amount I want things to
shift instead of 0.05, right. And by looking at
this, I immediately know something’s weird, right. Something’s wrong. And this is, again, something
that the data scientist would then be able to
use throughout the output of the results. Now I’m just going to
go right to the bottom and let Neil start
explaining this piece. Essentially, this is
looking at the result sets. And Neil was
interested in trying to start looking at
how we can show this in three dimensional space. And so Neil? NEIL BOSTON: So
I guess the thing that I was trying to get to was
that I’m a big believer in just looking at a set of results
as well, to see just things that you can see visually. So when we went down the line
of running the experiment, then I asked Reza
if there was stuff we could put on top
to actually start visualizing what
this thing looks like versus various outputs. And when we looked at this,
I was interested to see– we were adding up the PVs,
I guess, the Present Value, of a lot of these trades. So it seemed like they were
relatively canceling out. So actually, when we
looked at the portfolio, we realized we hadn’t randomly
created buys and sells. We had done roughly half
buys and half sells. So they were tending
to kind of net out in some option
sense, which is kind of interesting
in itself and we thought about the
portfolio we’d created. So some of the talks
that I actually saw yesterday the with some
other groups, there were some– I guess ultimately, they
really resonated with me. I was trying to get to almost
like having a single inquiry box where I could type a
set of asks of the data. Is this a correlation
against this? Does this looks like
[INAUDIBLE] volatility-wise? How does this look
against this over time? It seemed to me
that it was always, this was kind of evolving, a
visual representation of almost like a single inquiry box. And I saw a couple
of demos yesterday where they had that style of
approach, and a little bit of, I guess, almost like an
LPO over the top as well. So you could type in quite
market specific language, and it would give you back a
set of results around that. So I thought as an
inquiry, marrying that to this style
of experimentation output would be
quite a strong thing. So this was really
just a very basic– I tend to plot
time, against spot, against PV, against other
things in a very basic way, just to start looking at what
the nature of the portfolio is. But then given the couple
of days I’ve been here, there seem to be some–
already some evolutions that I was thinking of
around how we could start analyzing the data in
a more effective way, and asking questions
as well in a much more fluid, experimental
sense as well. REZA ROKNI: Thanks, Neil. Could we go back to
the slides, please? Go back to the slides, please? Thank you. So just in terms of creating
the pipeline, in that one, we showed the
preprocessing stages, the running of the stages,
the post processing stages. And the semi joke here is
one pipeline to rule them all, right. We don’t create
separate processes for each one of these things. We just did one, and it went
from the beginning raw data all the way to the output,
including analysis. And one of the other– so the other thing
that we wanted to do was one of Neil’s first wishes,
is I want to experiment. So as we now have this DAG
to be able to do processing, the other thing we wanted
to do is break apart some of the larger, more
monolithic quant code and see if there are
linear components in there that we could distribute again. And Neil and– I’ll let Neil
and Reuven do some of this. NEIL BOSTON: Yeah, so
again, it was something that I guess I should have
thought of over my career a little bit more
intensively about. When we go into
this sense that we have to stop worrying so
much about the ETL piece that we were running here, and
the simulation and ecosystem piece, we could
then start really thinking about where
there are parts, certainly within the code base
we were writing, that we could actually look at in
a more linear away from a parallel
processing perspective. And again, even when
we were starting to look at that, the
solution that we’ve just been talking about, that
even had other implications that I hadn’t expected, where
we started looking at the way that we could break these
problems apart from what we’ve seen in the past,
just as someone writing some quantitative code. REUVEN LAX: This
is a common thing we’ve seen as people move
from their old systems to new systems, is
people have code that was written assuming
I’m running on one machine. And it would be highly
optimized in ways that make it very difficult to
parallelize, like oh, sometimes I’ll spin up threads and
then join on these threads, or interleave many
different functions throughout my calculation. So it sometimes requires
people to rethink how they wrote the code, or
maybe on a mathematical sense, look at your functions,
say, can these functions be [? linearalized ?]
as a combination of multiple other
functions, so that we can parallelize all
these things out and get a finer result later? REZA ROKNI: And so what
that means is that when you’re drawing out the graphs,
the more access me as a data engineer has to the
various functions and being able to
split them apart, the more I can try
and think of ways of optimizing the flow and the
parallelism within the system. So the more I can have
separate functions that I can pass
data to and from, the easier this becomes for me. And this gets another–
an interesting point that we found very
useful when we were doing this
project together, and that’s how we
talk to each other. So as you could tell, I
have no idea half the time what Neil’s talking about
when he’s doing the maths. It’s not my area, and
neither is the finance area. However, when we break
the problem down, for me as that
engineer, to bytes– so he can tell me, I need
these bytes to move from here, and I need to run this
function against these bytes. This output needs to move here. When you start breaking
this down into a way that I understand, it
becomes very easy for us to start communicating. And one of the things that
we used as a common language is protobufs. So we started talking
in terms– so protobufs, I’ll let Reuven do
what protobufs are. REUVEN LAX: Yeah,
a protocol buffer, it’s a common structured
format for data that has been used at Google
since probably around 2001, 2002. So it’s been used at
Google for quite a while. It’s very similar to other
forms of structured data. So some people use Avro. Some people use Kryo. Some people use Thrift. There are other forms
of structured data. Protobufs are an optimized one
that Google has traditionally used and is also open sourced. REZA ROKNI: And that
made it very easy for me to start
working with Neil, because I could just give
him a definition of how I’m going to pass things
through, or he could give me the definition. And then I know
what I need to do in terms of doing this process. And the other thing it
opens up is this ability now to actually break
apart the functions. Because if all you’re passing
as the parameter to the function is a protobuf, and
within your code, you start just using protobufs
to pass the information around, when you break apart
the big pieces of code, it becomes very easy to
plug it into the DAG, because all I’m doing
is sending protobufs across those nodes and that data
movement that we talked about. So with that, next
steps, as we said, running C++ on BEAM
is not standard, so we got an example that
allows you to do some of that. I’ll be updating that code a
little bit more after this. Please have a go, and thank
you very much for your time. [APPLAUSE] NEIL BOSTON: Thank you. [MUSIC PLAYING]

Leave a Reply

Your email address will not be published. Required fields are marked *