Cloud computing for Higher Education (Google Cloud Next ’17)

Cloud computing for Higher Education (Google Cloud Next ’17)


BRAM BOUT: All right,
good afternoon, everybody. My name is Bram Bout, a leader
at Google for Education team. And I’d like to
talk to you today a bit more about
what we’re doing in higher education,
for higher education, with higher education to
create more Brittanys, and more amazing projects like the
one you just saw up there. We are going to
talk a little bit about the cloud for
universities– what it means, what it can do for you. Then do a deeper
dive into genomics, one area in
particular where we’ve seen tremendous use and power
of the cloud in higher education research. We’ll do a few breakouts to get
your input and your thoughts on what we can do better, what
we can do to work with you. And then we’ll wrap it
up in about an hour. So the Cloud for
Higher Ed, we’ve been in the cloud for
a long time as Google. We were born in the cloud, as we
would like to say, back in 1999 or so. And we’ve been operating
as a cloud company from its very
inception ever since. So in many ways, we’ve
been investing in the cloud from the very beginning. But it’s only
relatively recently that we’ve begun to take
all that engineering and all that power out into the
market and give access to companies and academia
to do something with that. If you had an opportunity to
see the keynote yesterday, I think you heard Eric talk
about the opportunities ahead of it. You heard Faffa Lee talk about
the power and the potential of artificial intelligence that
can be powered by the cloud. And what really excites me
about that is that to me, the academic world
plays a tremendous role in making that vision
happen for all of us. There’s a lot of work to be done
to make the cloud work for all of us, to democratize it, and
to really deliver on the promise that it inherently has. So we’re very committed. I hope you walked
away with that message at the end of these
couple of days. We already have more than
a billion end users that are using the Google Cloud. You may not know or
see this directly, but those are companies that
are using the cloud to power services for their users. We’ve invested about $30 billion
over the past three years in Capex to build out
the cloud network, which is a substantial
multiple actually of what our competitors
are doing in this space. And we have a global network. About six regions right now,
and adding nine more this year. That gives you, again, an
idea of how aggressively we were expanding and
investing in this space. And we’re investing heavily in
the networking aspect of this, too. So when you are operating
in a Google Cloud, you essentially are
everywhere worldwide through Google’s
own private network. As an example, we just laid
another 60-terabyte-per-second cable between the United States
and Japan to build out that network even further. And we will continue to
invest in that going forward. And then lastly, specifically
for the higher education community, we just recently
in 2016 joined Internet2– obviously a well-known
organization of research universities in
the United States. And we are working
with them on a number of what I think are going to
be very interesting projects and ideas to bring the
Google Cloud more easily and more powerfully to
the higher education community in the United States. And we’re working on similar
types of partnerships internationally to make
that happen as well. So what does this
all mean for you? If you are in higher
education, if you’re a university or a
research institute, how can you use this
to your advantage? Well, we see essentially
three ways that you can use. The first way is in the
teaching environment, just using it to teach the
next generation of students where cloud computing really
should become or should be, arguably, a part of every
curriculum going forward, because it is going to be
so pervasive in the future workforce that
anyone who comes out of school in a couple
of years really ought to know about the
cloud, about data science and what it can do for
you to make projects, products, anything better. The second one is
around administration– how you run your university. And that can be the more basic
infrastructure type things, but it can also be how do you
use the data that you collect, the data that you have on
the variety of interactions you have with your
students, your faculty, your administrative staff. How do you use
the data to create insights to run your operation
better– to do new things? And lastly, of
course in research, we will talk much
more about that. So for education, it’s
really about helping students build what’s next. You saw a great
example of Brittany just now in the video
at the start of this. And as I said, this
is something that we should do much more often. We already have since
about the summer of last year, a program in
place for education grants for Google Cloud Platform. So this essentially
allows any faculty to apply for credits
for their students. Currently, it’s limited or
restricted, if you like, to faculty of universities
that are providing a computer science type of education. We’re kind of looking
at expanding that. And up until now, it has been
restricted to the United States as well. But I’m very happy to
announce that as of this week, we’re opening that
up to most of Europe, as well as a few
countries outside of that. And we will continue to
expand that globally. So the list of eligible
countries is here. And if you’re from
any one of those, please take a look at
cloud.google.com/edu and see if you can get your
folks, your faculty, interested in signing
up for credits. Get some exposure,
get some experience with what the cloud can do for
you in the computer science environment, and
obviously beyond that. An example of how this is being
used to make it kind of come to life a little bit is in
California State University in San Bernardino. One group of students
there is using the cloud to really understand networking. So they’re using the cloud to
configure networks, firewalls, different routing
mechanisms, creating false denial of service
attacks to see how networks react to this and so forth. And they can do all of that just
sitting behind a computer using the cloud as their,
essentially, virtual network that they’re running
to see and bring a lot of what they are learning
in school into practice. Another example– you see
the pictures off here– is for students that are not
actually computer science students, but
engineering students that are building a little Rover
with a Raspberry Pi in it, with some Python scripts
that is talking to the cloud. And it allows them to
essentially remotely control that device using a computer
anywhere on campus– just a really great
interesting project to illustrate the
power of the cloud and what it can do for you. Then on the
administrative side, this is all about being
more efficient, and being more
effective, quite frankly, with your infrastructure. We have a massive
number of products out there that will
allow you to supplement, or if you want to replace
parts of your infrastructure, in a very efficient manner– whether that is basic things
like storage or disaster risk recovery approaches,
or whether that’s a more advanced
use of computing, like for instance Burse capacity
for your high performance computing network. All of those things
are possible. And I think what we
are very proud of is we have a very cost-effective
and efficient way to do that. We also want to
make sure that you think about what the
cloud can do for you to make operations run better. How can you go to
the next frontier? How can you use the data
you have to do new things? And one of the topics
that has come up a lot in our conversations
with universities is, for instance, the
topic of student success. Like, how can you improve the
success rate of your students? How can you intervene
earlier in the process if you feel that
they’re beginning to go off the rails a
little bit and catch them before they truly fall off? So there’s a lot of
opportunities to do that, and we have a number of
products to work with that. On the efficiency side, we
have very flexible billing arrangement. We actually acquired a
company called Orbitera, if you are familiar with
that, a couple of months ago. They offer, essentially,
a billing overlay across multiple cloud
providers so you can have it all in one
place and really see who is spending what where
without necessarily having to get in the way as a
central IT department and sort of become the
bottleneck for everything that’s happening in
the university, which is from what we understand
pretty much impossible. We also offer a lot
more flexibility. We have, for instance,
customizable VM’s. You don’t have to choose
two, four, or eight cores. You can choose five, if
that’s what you need. Any type of memory
configuration, bandwidth, and so forth– it’s all very
flexible, billed by minutes, so you don’t pay
in large chunks. And we have automatic
sustainable use discounts, which means that you don’t
have to reserve an entity to get good pricing. If you use an entity
a lot, you simply get a better price,
because that’s what a sustained use
discount just for you– as well as future discounts. If we change our pricing,
and I’m sure we will, you actually benefit from
that immediately, regardless of the agreements
you have with us. On the Big Data
side, we have a lot of tools that are
very powerful here– BigQuery, TensorFlow for
machine learning and kind of new algorithms, particularly good
for structured and unstructured data. So combinations of
things that are typically difficult to manage in
a more classic database environment, that’s
an area where I think our models and
our products excel. Hopefully you’ve had a
chance to learn a bit more about that in any
of the sessions here over the last
couple of days. And if you have
any questions, we can talk much more about
that a little bit later. Lastly, and perhaps
most excitedly, we want to give you the
power to change the future. We met with a large research
university a couple of weeks ago and we were talking
about the power of the cloud, and what can it do, and
how can we work with you. And they were
saying, yeah, we love this, because really we
want a partnership where we can change the world. We think there is things we
can do in the medical fields or in the field
even of humanities where we can have a dramatic
impact if we work together and bring the power of the
cloud to our research community to really do amazing things. Again, we have the basics
in terms of affordability and access, but we have
also very innovative tools around machine learning
and advanced APIs. And I think we are very
confident and proud that we have these kind of
bleeding-edge innovations, and that we bring
them into production and make them
available to everyone as quickly as possible. So recent innovations
like TensorFlow, but also Cloud
Spanner, for instance, as a globally
consistent SQL database, are all things
that make it easier for researchers to use
the power of these tools very efficiently. And lastly, you can
start where you are. We don’t advocate
the replacement of what you have, and just
kind of go into the cloud and leave everything
else behind. We really want to
work with you to see where we can begin to work with
you to expand what you can do. Again, overflow for your high
performance computing center. Or even just like a weekend-type
burst for something very specific– something we did
with CERN a while back– can be extremely powerful,
and actually surprisingly affordable. We have a ton of products. I’m obviously not going to
go through each of those. But to just give
you a feel for that, and if you walk
around here and have been able to attend
some of these sessions, I’m sure you’ve learned
about some of these. We have a ton of products
in data and analytics. We have products in
application development, all the way from
completely managed– you just give us
the source code– all the way back to you do it
all yourself, and pretty much everything in between. We have lots of tools and
products around infrastructure and operations to make your life
easier, whether that’s logging, or log analysis, or
billing and accounting. So there’s a lot of places
where you can start. And I think one of the
challenges in some ways is to figure out
exactly how we’re going to go about that, how
we’re going to work with you. Now for many of you, a lot of
these tools may appear new. And to some extent, they
are, in that we’ve only made them available externally
relatively recently. But take something
like MapReduce, for instance, which was
essentially developed by Google engineers in 2004,
a very long time ago in the form of white papers,
and has been in production within Google since 2004, that
sort of spun off into Hadoop– like an open-source platform
for that particular thing. The same thing with Bigtable. BigQuery is internally
used in Google as called Dremel, and has
been there since 2008, and is still heavily used today. Even the Cloud Spanner that was
just announced very recently has been in production
in Google since 2012. So the benefits you have here
is that you have products that are cutting-edge, really
state of the art kind of ways to think about some of
the big computer science problems of our time, but
that have been tested, improved, and hardened
at massive scale within Google– because
we do a few bytes of data. And you get the benefit of that
in your production environment. The moment you switch it
on, you have the knowledge that this has been
tried and tested in a very harsh
environment already for a substantial
amount of time. So that’s one of
the main things I think sets us apart from
some of our competitors, is that we have tremendous
innovation, particularly in the field of big
data machine learning– and that innovation has
been tried and tested and will continue to
accelerate over time. One example of
that is with Cern. This is one of their
collider units, I think. I’m not sure exactly
what this is, but it looks very impressive. It’s what? Sorry? The particle accelerator. There you go. So what we did with
them– obviously, these generate massive
amounts of data that need to be processed,
need to be visualized, need to be dealt
with in some way. So we’ve become a partner
of their high energy cloud– high energy physics cloud. And we’re now helping them
and the researchers that want to use this data to process
this and to churn through this. And we’ve done experiments
and projects with them where we’ve literally spun up
100,000 cores over a period of, let’s say, a weekend, to
do a very rapid burst of data analysis that would
have taken months in any other environment,
quite frankly, to deliver these
kinds of results that we were able to do in
a very short period of time. And that’s the really
exciting stuff. And there’s a lot more that
we can do together with them. The last example I’ll
give you of a partnership is the one with the
National Science Foundation. We just announced this
a couple of weeks back, a partnership for their Big
Data Sciences and Engineering project, where they issued
a call for proposals, essentially. And Google is one
of the providers. We’re committing $3 billion
in credits for grants that the NSF is going to give
for research in this field, where you can use the Google
Cloud and all of its power, and all of the
products that we talked about to continue to develop
Big Data and the science behind this. So we’re very
excited about this. We’re very excited about working
with a higher ed community. We’ve obviously been
partnering with many of you for a very long time already,
largely in the area of G Suite. And we’re very excited about
opening this next chapter around the cloud. We see tremendous
opportunities for innovation, whether it’s in teaching,
whether it’s administration, whether it’s research,
there’s a lot of things we can do together. We’re very committed. I hope you walk away with
that from not just this talk, but from this entire conference. And we really see some fantastic
opportunities to work with you. So with that, I’ll
hand it over to Benny who’s going to talk
more about genomics. Thank you. BENNY AYALEW: Thank you
all for having me today. I just want to punch
through a number of things here in the next
20 or so minutes– everything ranging from
security and compliance to the reality of dealing
with both research and academic
environment, as well as clinical environments
as it pertains to the world of genomics. And then if the demo
gods are with me, I’m going to go ahead and
walk through a petabyte worth of cancer genomic
data sort of live, and tempt the network here. So we’ll have a bit of fun. So the very first thing I
wanted to sort of touch on is the typical question
that we get out from folks that we deal with
in academia and also in clinical environments. This is a bit of an
elephant in the room, so we hear this all the time. You’ve probably attended
your fair share of sessions around security and compliance
in all of the other breakouts that we’ve had, so I won’t
belabor those points. But it’s really
important to understand when we get into this space
of data ownership and data locality, any time you start to
deal with very sensitive data sets– phenotypic data sets,
things that are actually subject to HIPAA, folks get very
understandably very concerned around what cloud providers
actually have to offer, and what sort of
capabilities and where the lines of demarcations are
for whose responsibility it is to store and
safeguard that data. So we’ll run through most
of the answers for these. But the main thing to
remember is that, number one, we do sign BAA’s. And we do, in fact, have some
examples for you on that. But very often when
you start to look at the regulatory and compliance
templates to which Google itself holds itself accountable
through external audits, it very often meets and
frequently exceeds the letter and the spirit of the
controls stipulated in a lot of these
data protections. So the very first
example I’ll talk through is the Clinical Genomics
Service at Stanford University. I’ve had the
tremendous privilege of being able to work
with the folks there over the last few years. And this was sort of a
collaboration that started out in very early days
for Google in terms of sort of foraying
into that space, and really resulted in this
ongoing and very virtuous collaboration between both
the clinical environments, as well as the School of
Medicine within Stanford. So just a quick recap around the
use cases in Dr. Merker’s lab and in that
environment, the pilot kicked off sort of in the
January 2014 time frame. And this is an area
where, obviously, Stanford being world-renowned, is
an area that actually folks turn to when other care
providers have been completely stumped in terms of
investigating very rare diseases, familial
diseases, things that have been very, very difficult to diagnose
and get to the root cause of. And obviously, genomics
has been shedding light into how some of
those can actually be a little better diagnosed
and a little better treated. So some of the challenges
that they actually ran into was not knowing what
sort of ends they were dealing with right off the bat. I mean, people can estimate
to see how many cases you might get in
a particular area, but imagine having to
build capacity and think about putting out the
capital expenditure with an uncertain amount
of investment in mind. The other aspect also
is that there’s just a tremendous amount of
dynamism between the folks who are in clinical environments
on the hospital side providing care, but also
very actively collaborate on the School of Medicine side. So a person who might be
a given Google identity might navigate across regulated
and controlled boundaries, and so how do you sort
of solve for that? And those were some
of the challenges that we overcame together. The bit around elasticity
was a significant problem to overcome. And of course,
obviously, for that, the Google Compute Engine
and the Cloud Platform were able to be a fit there. The other aspect also was
flexibility in the language in our terms of service. Of course, obviously,
the lawyer cats, as we like to call
them in Googleland, have a good bit
of back and forth in terms of the red
lines that typify the sensitivities around this. But we were just in a
fantastic partnership together and were able to forge
this alliance going forward in tackling both the
technical and the legal issues that were required for making
this service actually light up. And so for that, there
were a number of solutions that actually came into being. Obviously, Google Cloud Platform
solved the storage aspect of issues. Google Genomics is
a managed service that actually allows
for a genomic sequence analysis on our
platform where you don’t have to spin up
machines and download your favorite
bioinformatics tool, and then compile it, and then
run it, and go through all of that common back and forth. You just basically bring your
data and then invoke the API, and get your reads and variants
in that particular area. BigQuery obviously was
also a huge part of this, because in the
world of genomics, which is a very sort of
staged analysis world– the tertiary
analysis, or the sort of so-called so what
phase, is an area where we’ve been hosting
a tremendous amount of public data sets that
we’ve made available that I’ll walk you through. And so this is basically a
conglomeration of both managed services, as well
as unique capability that exists on our platform
that we’re able to provide the solution for Stanford. And I reckon that
speaks for itself. From my good friend
Somalee Datta in terms of tackling
these problems, these are the typical
issues that we tend to see in
campus environments where, yes, the campus
cluster was sufficient for a number of analyzes,
and will continue to be viable and useful for
the time period going forward. But when it starts to get to
burst analysis and being able to deal with
multiples of hundreds of terabytes’ and often into
the petabytes’ worth of space, that’s really the
space for cloud, and it’s really an end
world where the data sets and the analysis pipelines
are straddled between both environments, and it’s not an
or solution where everything shifts over from
one to the other. Another example of
what we were able to do around genomic research
was our collaboration with Autism Speaks. And here’s an area where
it’s actually multinational. There are cohorts of autistic
children’s and familial sort of quintuplets
quints of data sets that actually are contained
in this environment that are used by researchers both
inside the US and, of course, abroad as well. And so what we were
able to do there was almost analogous
to the Stanford case where we provided both
solutions, as well as the open access to that
dataset for researchers to be able to come in
and make use of it. Another area that I’ll just
jump into in the spirit of the opening
video to the session is really around the
area of cancer research. The Institute for Systems
Biology have, in my mind, done a very exemplary job of
managing that really fine line between allowing an environment
for researchers and clinicians to come in with very
sensitive and private data, but then be able
to do that analysis in the context of public data
sets like the Cancer Genome Atlas, which, by the
way, in its natural form sits on a government site
nicely zipped up and weighs in at a petabyte. And so if you’re a
cancer researcher and you want to do that unit
of work, which is a tumor normal peer analysis, your
step zero of that pipeline is go ahead and
download a petabyte. That make get you very unpopular
with some of your storage admins on campus. But we’ve actually made
that a very simple exercise to go through. And so I’m going to
go ahead and jump through a couple of live
demos just to give you a sense of what that
actually looks like and what that feels like. So this is the ISB site, and
it’s actually open in public. And I think the part I was
saying that was very exemplary is really depicted in this
part of the web page done by the team over at ISB. And this really solves the
three key constituents, which is the principal
investigator, the computational
researcher who’s very, very comfortable
in Python, or R, or just in SQL, and then of course,
the algo developer who doesn’t want to see anything
but usually a command prompt. The red boxes
there are obviously signifying user data,
which is completely private to the individual
researcher, not for group sharing
or public sharing. And then the availability
of open access data, which is obviously
the Cancer Genome Atlas, and a number of
other data sets of interest and of high value for
analysis are signified there in the green. And then the orange
one is a peculiar one, and that actually
goes into the database of phenotypes and genotypes. It’s called DB Cap. And it’s essentially a
controlled access dataset that has sometimes identifiable
data intended for consumption by qualified researchers So you can now
imagine a world where the storage is taken care of. And for qualified
researchers, they can now have sort
of a workbench area where they can
actually do their work and do their analysis
in the context of both their private data sets
that might actually belong to a patient
that they’re treating or a clinical trial that they
may be a part of, but also in context of larger public data
sets, and as well as controlled access datasets, without having
to go and change their work bench or retool in any way. So they can do that for storage,
they can do that with BigQuery for tertiary analysis,
and then they can do it using our
Google Genomics. And they’re able to
pull off this magic without having to give up
any level of flexibility to the lower left
that you see there, which is that world of
complete open pipelines where you can now bring in
your favorite Docker container and install this newest
version of an analysis tool and still be able to take
part in this virtual cycle without having to
really break anything. So as we jump into that,
you can see the first thing that happens is it’ll prompt
you for your OAuth credentials. And it just opens up
with this blank space. And it has the notion
of a workbook concept. And that workbook can use all
sorts of different analysis types and different plot types
that you can now put into it. But just as interestingly,
the business of creating cohorts from
very large data sets is a unit of work
in cancer research. So don’t blink for this
part, because we’re going to click this bit,
and what you’ll see is– OK, that just
happened way too fast. I didn’t even get
a spinner on it. That just loaded and
visualized the entire TCGA all on a single page,
which is extraordinary, because this sort of passes
the [INAUDIBLE] test. If you can get a CS guy to walk
through cancer genomic data and make some sense of
it, and navigate it, then I think it’s met
the usability test. So if I now want to
go ahead and create a cohort of, say, all females– and I’m interactively
walking through this and it’s just spinning it
up as it’s going along– whose cancer was diagnosed
between those age ranges and who are smokers, as
just a simple example. And now I’ve whittled this down
to 115 smokers who are female and between the
ages of 10 and 39. Now I can go ahead and
walk through and figure out if these cohorts were
part of any given study– which one of them
are alive or dead, sample types, where the
tissue was actually picked up. And I can now go in and
save this as a cohort, and then share it with
another collaborator. One of the other
things also is that you have some additional data,
like gene expression and mRNA, and all sorts of
things that are clearly of interest to people who
would work with this data over and over. So now I’ll go ahead and choose
to save this as a cohort. And I’ll just say– OK, e-mails. Maybe if I switched off the
search page, that might work. Hang on here. I’ll just type the
word test in there. So once this cohort has now
been created, what I can do is go ahead and share it
with another collaborator. And say, hey, for that clinical
trial that you’re working with, how does this particular cohort
look like in your analysis? The other aspect also
is if I’ve got multiples of these I can now
start to do set operations on these–
not in, or complement, or intersection, and so on. So these are units
of work that would have required lots and lots
of coding, and lots and lots of data wrangling, which
are now made very, very interactive and useful. And then the part that
I was mentioning around that DbGap credential is that
for the same cohort that I just now created, if I am
a qualified researcher in an academic environment, I
can now go in and put my DbGap details into this
page and authenticate into that secure environment. And now I can get additional
data brought back, additional results brought
back, from my same query. I don’t need to go and
recreate it all over again. It’s a very, very
seamless integration. So those are just a
couple of examples I wanted to walk you through. We’ve gone ahead and
created on Google Genomics a tremendous amount
of content intended for consumption by researchers
and folks in academia. There’s our GitHub page. I meant to actually
go to this page. The very first item
that you’ll see– the keywords are Google
Genomics cookbook– is this very rich and
continuingly growing data set that,
again, really removes the friction out of having to
do a bunch of heavy lifting and copying and duplicating and
localizing of data to just get the work done of
continuing the research, rather than dealing
with the data wrangling. So those are just a
couple of examples I wanted to go ahead
and walk you through. And obviously, within research
and collaborative environments, notebooks are a very popular
item and environment. So one thing I wanted to just
sort of quickly leave you with is that we also have Google
Cloud’s data lab environment, which is really hosted Jupyter. And it allows folks to be
able to connect naturally into BigQuery query
and everything else in our environments,
utilizing this environment. So that’s my piece. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *