“In Search of Impact” – Dr. Francine Berman

“In Search of Impact” – Dr. Francine Berman


I’ll try not to stand there,
because I’m only this high, so you can’t really
see anything. And I want to tell
you what a pleasure it is to be here, not merely
for the really important focus of what we’re
talking about today, but I want to tell you that
the people in this room have been inspirations
for decades. And I know there’s
a lot to be done, but so much of the work that’s
been done in our community has been done by all of you. And so it’s a thrill to get
to be with you, to stand before you, and from
the bottom of my heart and on behalf of, I’m sure,
so many in the community, I want to thank you
for all you’ve done. It’s really a big deal,
and thanks very much. So let’s get started,
and I thought what I might bring
to this meeting today is a perspective of the user. That’s been my whole
career as a researcher, and even when I’ve been
in leadership positions, it’s the user community
that I really represented. So I thought it might be fun
to look at metrics of success for repositories and the
issues around repositories, from the user perspective. Now let me start with a
slide that’s way too simple, but it’s good to frame it. Why do we care
about repositories in the first place? And from the user perspective,
what we really care about is the purple stuff. So we care about innovation. Innovation drives what we do. That’s what we’re
rewarded for, that’s why we became researchers
in the first place. And in the 21st
century, data drives innovation in so many
areas, whether you’re thinking about astronomy,
or space, or biology, or the spread of the Zika virus,
or whatever area you’re in. These are data driven areas,
and our data needs a home. Data that is not stored
somewhere, taken care of, may cease to exist. And so repositories really
provide an incredibly important home for our data, services
for our users, and really are the foundation on
which this innovation goes. So how might we think
about all of that? And from a user’s perspective,
you can think of repositories as having a really high
goal, and that goal is, to the first approximation,
that data of value is available for access and
use now and in the future. Yes, that’s roughly what
you would like to see. And so, today I thought I would
focus on three of these areas. What do we mean
by data of value, what do we mean by access and
use, what kind of services are useful to communities,
and what do we mean by now and in the
future, so stewardship and preservation,
how does that matter. So let’s start with data value. Obviously, one of the
most important pieces of data of value is
digital images of space and all around the blue planet. So what data is of value? And really, the
right question, when you think about what data
is of value, is to whom? So as a society, we very much
care about historical records. We care about records that we
need to keep, by legislation. We care about cultural assets,
like the Shoah Collection of Holocaust Survivor Testimony,
that you cannot reproduce if you lost that data. Now we’re collecting
tweets and e-mails, and all of this kind of stuff. So that’s really important
to us as a society and you can say, arguably, that’s
in the public good. There’s a lot of data that’s
really important to me, my tax records, pictures
of my kids’ graduations– those are some of my
kids in those pictures– etc. Unless my kids become really
famous or in the public eye, no one is going to be interested
in keeping those pictures. But I’m very interested
in keeping those pictures, and so in some sense,
I take care of that. That’s my responsibility. As a society, we try to
take care of the things that are important to us. The research community is when
we start getting into trouble. So what’s important to
the research community? And if you think about value– and these are types of values
that researchers told us in the Stewardship Gap Project,
that Myron Gutmann and Jeremy York, and myself, and
a number of advisors, including Trisha, who’s
here, were thinking about, is what is the value
to researchers? And it turns out that it’s a
number of different things, not just one thing. So data’s valuable to you
for your own research. Data’s valuable to you if it’s
in a highly cited publication. Data is valuable
if you’re funding agency is making you make
it provided to the public. Data is valuable for you if it
underlies assessment reports. And so there’s a
variety of reasons that data might be
valuable to you, and you might want somebody to
help you with the stewardship and preservation of that data. Now if we look around,
value is really in the eye of the beholder. There are reference
collections like– [INAUDIBLE] collection,
that Ava is doing, and PDB, that Helen was involved
in, and a number of people are involved in. There are irreplaceable
collections, those that are really valuable to us. And then, this is just
the tip of the iceberg on a number of groups from the
Research Data Alliance, all who have valuable data,
in various ways, and are storing them and using
them in various ways, as well. And you see, there’s just a
wide spectrum of groups from [? Rice ?] data, to
[? OnPharm ?] data, to health data, to linguistics
data, to marine data, to small, unmanned aircraft
systems data, etc. All of those translate
into collections and data sets that are really
important for the research of individuals, projects,
communities, etc. So your mileage is
varied, but all of these are considered valuable. Now one of the things that, in
the Stewardship Gap Project– what we did was we asked
roughly under 50 respondents, 46 respondents. They had about 120 data sets,
from nearly 80 domain areas. They were mostly
NSF and NIH data sets, which the
data sets are NAFA, and NAH, and Sloan, and the
Bureau of Reclamation, etc. And we did a rather
in-depth survey of them to try to understand
the relationship between their interaction with
the data, and their stewardship choices. And I’ll show you
some more graphs later on, as they’re
useful in the talk. But there was a number
of reasons they gave us– this was not things
we gave them, but they gave us– for
why the data was valuable, including use by others
in their own research, and difficult to recreate,
they had to store it by mandate, or audit, etc. And the colors here
correspond to how long they thought the data was valuable. So blue is indefinite. A lot of people
thought their data was going to be indefinitely
valuable, forever. A lot of people thought it
was greater than 10 years, or less than 10 years, etc. But one of the things
this should tell you is that forever is a hard word. It’s a hard word to fund,
it’s a hard word to justify. But a lot of researchers really
have a more sophisticated sense of the timeliness of their
data, and as time goes on, these things change. So this graph will evolve
five years from now, etc. Now one question
you might ask, which is really the key
question if you’re looking at it from an
organizational perspective, is what value of data is worth
what amount of stewardship? And in that sense, we’re
trying to trade off what we need the data for, how
much we’re going to invest, and how long we’re
going to invest it. And so that’s a
really key question. It’s the question that
every organization that tries to host or steward
data has to deal with. And typically, our
discussions in the community of value and investment
are largely decoupled. So oftentimes we talk
about stewardship of data, we talk about supportive data,
but we’re not coupling it with this notion of valuable. It’s like everything’s
valuable or nothing’s valuable, it’s valuable forever or
it’s valuable not forever. And the fact is there’s a lot of
sophistication in those terms, and it gives us
something to grab onto as we start
thinking about, well, what can we actually
do about it? One simple way of
thinking about it is– and I apologize if this
looks like it’s small font, so those of you at
the end of the table, just ask me what
these things are– is that you start
with a data collection and imagine that we have some
community value criteria that tells us where it’s going to go. And where it’s
going to go means is somebody else to
take care of it, or am I going to
take care of it? And so, does it meet
those criteria or not? And then over time,
because you might want to periodically
reassess things, has it gotten more popular,
has it gotten less popular, is it more valuable to the
community in some way, shape, or form. Then you want to
reassess it, and maybe change where it gets hosted
or where it gets stewarded. So in some sense, if we knew
what the criteria was for where it goes, and what the bar was,
how high does it have to be, and how low does it have to
be, to be in the [? series ?] boxes, and what these time
periods are; it might give us a way towards
economic models that were more manageable
than forever or not forever, completely
valuable or not valuable at all. So we have to get
our heads around it to be able to do something
that’s actually actionable. So what metrics are of value
to users and user communities? Then you can think about– there
are the popularity metrics, and that’s data that’s
associated with highly cited publications, and data
that might be downloaded a lot of times, or they
have a lot of hits, or they have other
distinct users, or they have a lot
of return users, etc. You can think about
things that are our responsibility,
and that’s data that’s expected to be retained
by somebody, for example, maybe your funder
expects you to retain it– data that might be
unique or hard to replace, and that is of
community value, data that you’re storing as a dark
archive for somebody else. And then you can
think of some reasons we want to store the data,
because it empowers us– so data associated with prizes
and awards in our community, science breakthroughs of
the year, things like that. We need to keep that
data, that’s really important to our community. Community reference collections
are really important to keep. And this is a bit hard to
quantify, but if we could it would give us a little bit
of a way of moving forward. So what’s the data
on which it’s likely that new results will depend? And if we can start
figuring out what that is, we can get away
from these extremes to start building
models that help us host the data in this gray area. So let’s move to the next
thing, which is access and use, and we have data value. And we have to figure out
what we mean by value, and how that
relates stewardship, and how that relates
to investment. Now let’s figure out what access
and use should mean to us. And first of all, to a user,
access means where’s the data, and does the data I need exist. And in some sense, if I go to
ICPSR, or I go to a repository, I’m going to be able to
look through the data and find out what’s there. But what if I want
to go up a level? And what if I want
to ask the question, does the data even exist,
where would I go to find it? And it turns out that
that’s a hard question, and we’ve looked at
it as a community in a bunch of different ways. One way we’ve looked
at it is to say, well, can we make some part of, like,
a whole Earth catalog of all the data that’s out there,
and all of repositories that are out there. But the problem with that
is as soon as you even get that stuff on a web
site, it’s out of date. And so the refresh
problem is really hard. You could also
say, well, let’s do it like we do everything
else in modern life. Let’s see if we can google
it or bing it, or internet explore it or duck, duck, go it,
or pick your favorite browser. But the problem is that for
the scientific community, the data is associated
with publications and data collection, not
a set of keywords that typically, my browser has. So it’s really hard to
use that, unless there’s something that is done to
collections and to browsers that help me do it. Then there’s the
keyword problem itself. So when I ran the San
Diego Supercomputer Center, you could have called
me a steroid computing guy, or a high
performance computing guy, or a parallel processing
guy, or a grid computing guy. And if you looked at
any one of those terms, you might miss some paper, some
reference, or some something, where I was identified
with the other term. And this is the tiniest tip
of a huge iceberg for a lot of the things that we do. We have terms. We don’t have really
mainstream, consistent terms that absolutely
almost everybody uses. A lot of times, we
make up our own terms. And so, how do we find stuff
when we want to find stuff? And lastly, we have the
problem of timelines. If I’m searching using
a regular search engine, for a particular data set
that I saw in a publication, how do I know which version
of the data set it is? And one thing that one of the
groups in the Research Data Alliance has is
a really nice way of putting a time
stamp on data sets. So the Dynamic
Data Citation Group is able to use a
bunch of guidelines to say how you refer
to the data set, etc., so that if I read the paper
and I want to reproduce the results, I’m reproducing
it with the same data set they used, and not the
updated version of it. So some of those problems
you can kind of solve, but some of them are
technical problems that need to be solved in order
to use what we think of as normal searching technique. So access turns out to be
sort of a social problem, but also a technical problem
that we need to solve. The next thing to think about
is access is not good enough. So I may know that pile
o data is somewhere, but I actually want to
do things with that data. And as a researcher, that
data isn’t useful to me if I don’t know what it
means, if I can’t find it, if it’s not in the
right form for analysis, if it’s not preserved
in a way that I can reproduce the results. So all of that turns out to be
really important usage tools and services that I
would like to have associated with my data. So this talk will be available,
and I know you can’t read it, but let me encourage you to
read this slide at some point. And let me tell you what
this slide is about. So recently, through
a set of situations, we’ve been talking
with the Internet Archive, who is very interested
in doing hosting of data. And they think of themsel– scientific data,
not just web crawl. And they think of themselves
as the big hard drive. We’ll be the big hard drive. We’ll be there for free,
we’ll be there forever. And so I asked some of– not
a statistically appropriate sample– I asked some of the
RDA groups that I thought might need
stewardship, would this be of interest to
you, would you be interested in putting your
stuff in the Internet Archive? I asked four people, one of
them that said, yes, we’ll definitely do that. The other three had
caveats, you know, we’re only interested if there’s
some persistent identifier, or there’s a service
level agreement, or there’s some sort
of services that help me use this
frequently, and by the way, they have to promise not to go
out of business soon and dump my data back to
me, like happened in my home country of x. So users don’t just
want a big hard drive, they want to make that
data useful for themself. And so that’s a really
important thing for repositories to think about– what are the typical
cohorts, what do they want from their
data, and how easy will it be for them to create
these use case scenarios. So what are general things– Just in this room
a few months ago, the CCIS advisor for the CCIS
advisory committee meeting– CCIS had a data
science subcommittee looking at data
science and what CCIS could do over the next decade. One of the images from
the report from that group was a cartoon about
the data life cycle, or a data life cycle. Everywhere along here, there
are services that users want. They’re services that take
the data from instruments and fieldwork, and
surveys and devices, and put it in some
reasonable form for you. You need to organize it,
or filter it, or clean it, or attach meta data
to it, whatever. You want to use it
in various ways. Maybe you want to analyze it, or
mine it, or visualized from it, or compute with it, or
do a lot of other things. You want to publish it. Maybe it’s there, available
for the community, in a portal. Maybe you create a database
that people can use, you couple it with literature–
there’s all kinds of things you want to do. And of course, what you want
to do is as you go along, make sure that the appropriate
policy is relevant, we’re taking care of it for
stewardship and preservation, there’s a record of
it’s provenance, etc. All of these things
involve services that users want
around their data. And it’s really important for
us to think about how do we make that easy, because at
the end of the day, the goal for a researcher
is for the innovation to be the roadblock, not for the
technology to be the roadblock. Technology should
be the enabler, and just the
challenge of solving a problem that’s never
been solved before should be the roadblock. So many service models are
specific to communities– and I grabbed this off
Maggie’s and the PDB website. These are things
that are advertised. They’re really useful
for those communities, and my assumption
is that communities have gotten together and talked
about what’s useful for them, and there’s been
a good dialogue. One of the RDA
groups that I think has been doing such cool
stuff is the Wheat Data Interoperability Working Group. If I want to answer
the question, what kind of strain of wheat is
going to thrive right here, I might want to look at the
air quality, and the rainfall, and the bio– the
germplasm of the wheat, to sort of figure
out what’s going on. And they’re– and to be able to
even interoperate among those data sets, I need a
common vocabulary. So they’re doing simple
infrastructurey kind of things, which is
how do we develop a common vocabulary,
what are data formats and standards that we might
want to look like, what’s the prototype interoperability
framework for questions like the one I just asked, etc. So what the kind of
services they’re looking at are how do I look at standards. So let’s talk about
standards for a second. So we all love standardization,
because standardization really simplifies things
for communities. And whether you’re looking at
the Astronomical Visualization Metadata Standard, which helps
people from the astronomy and astrophysics community,
to the oscars.org, you all probably saw that very
interesting show, especially the last few minutes. One of the things that
the science and technology part of the Oscars
is doing, are really trying to figure out if I
make a film and the colors need to be the same, whichever
device I’m showing it on. And they actually
have some standards that they’re looking at to make
sure that the colors are going to come out to be the
same, whether it’s on my iPhone, or my
laptop, or whatever. So the standards are
great, and they really help us understand things. But standards before
the right time are really bad, because
what we need to do is experiment with
what’s useful. And so here’s another thing
that professional societies, and repositories, and
communities can get together on to try to figure out when is
that tipping point, when we’ve had enough experience with
various ways of approaching things that we can settle
on something that’s going to be pretty reliable and
pretty good, most of the time. And that’s our standard. And that actually takes some
thought and some evolution of the community. And then another question to ask
is what role should all of us play in that process. Should it be you know,
just funders and publishers of professional
societies and domain communities, the
Research Data Alliance and other kinds of groups. What is the appropriate
role for all of them? And so those are things we need
to be thinking about as well. It isn’t a frank talk unless I– you know, things cost money. We have to pay, and all
these services cost. So at SDSC, back in the 00s,
Helen might remember this, we put together a repository
called Data Central. Natasha Bollock ran it,
and what we wanted to do was provide a
place with services and hosting for community
data sets of any type. So we had about 100
data sets and they went from art collections,
to seamount collections, to chemistry collections,
to astronomy collections. And we did it for
free, nice for us. We eventually went out of
business doing this for free. I mean, at least this
was with respect to this, because the cost of services
and storage are prohibitive, and especially if
you’re responsible and you make several
copies of each data set, and we had several
petabytes worth. At some point you
just can’t absorb it. You need a viable
business model. And then what we
tried to do is provide a lot of extra stuff
associated with it, so you could create
a portal, or we would help you with data
analysis and mining, or scheme a design, or
consulting or training or other kinds of things. So it was a valiant
effort for us. I’m really proud of it. But ultimately, the economics
will come to bite you, and I think all of you know
that really, really well. So when I start thinking about
measures of access and use, what’s success look like if
you go to PLOS Biology and some of the PLOS publications? You kind of see
the yelpization– I don’t know if
that’s even a word– where I can share it, or
comment on it, or write on it. This is a paper that
Philip Bourne and I wrote on gender diversity in
data science a while ago, and our score is
good or bad, and you can decide for yourself. But the fact that
I can look at that like I look at the new
sushi restaurant in Troy, New York is a pretty
interesting and new thing for our community. And so this provides us one
way of getting at the metrics that we hadn’t gotten
before, but they’re generally popularity metrics. And I think we would all say
that much data, some data is still valuable even if
no one ever downloads it, or if people don’t
download it now, but they may find
it valuable later. So stewardship and
preservation shouldn’t just be about popularity,
and we have to find a way of thinking about it,
and at the end of the day, organizations that
provide hosting of data really need to figure
out how you allocate the real state of your budget– how much of it needs
to be put into sort of organizational
stuff, and how much it needs to be in a stewardship
and preservation activity, and how much it
should be in services, and what are you going to
get return on investment for. And those are hard
decisions to make, and they’re often custom
and individual decisions. But they’re worth thinking about
because they get us beyond the yes or no, forever or
not, valuable or not. They get us into the area
where we actually live. So let me talk, just briefly,
about the practices for data repository, stewardship,
and foundation– my favorite data
gone wrong pictures, that I’m sure you’ll recognize. And first of all, let
us say that stewardship and preservation
challenges are real. A lot of the data is not
stewarded well, gets lost. I think there’s a lot of good
efforts in the private sector. We can’t assume
that they’re going to take care of the whole
problem, they’re not. No one sector is going to take
care of this whole problem. And so it’s really
important for us to come up with multi-sector,
diverse solutions that really help us provide the
stewardship that we need. So what are the practices
for data stewardship? And from an organizational
perspective, there have been a lot
of, in some sense, rubrics and ways
of thinking about, do I run a good repository. And so you can see, are
you track certified, and are things reasonable
with respect to community best practices. From users’ point
of view, they’re thinking about different things. So they want to know that
you don’t lose much data, and you definitely
don’t lose their data. Do not lose my data. You can lose other
people’s data enough so I can respect you in the
morning, but don’t lose mine. They want to know
that you’re reliable, that you’re sustainable, that
you’re going to be there. They want to know
that things are easy, and that there are
services if they need them. They want to know that
they can link data to their publications, because
that’s what researchers are rewarded for. And they want to know that
the repository is respected by its peers, and
so if you don’t want to give it to a place
that the community is not going to respect. One of the things that Myron and
Jeremy, and I wanted to look at is can we tease apart
the various ways that repositories can satisfy
users, and get better. What does that gap look like? Is it all just a money issue? Is it just like we dump in a lot
more money, and problem solved? And so we asked, in these
in-depth interviews that Jeremy did, we asked users, and what
we wanted to try to understand is this space between well
stewarded data and data at risk. What are the gaps there? And it turns out that there’s
a bunch of different gaps, but there’s sort of three
different buckets of gaps. One bucket is probably
what you would expect, is that people don’t
have enough resources. And that resource
might be money, it might be insufficient
tools, technical issues, it might be insufficient
facilities or staff, etc. So those are resources that you
wish you had, but you don’t. One bucket is really
what you can think of as cultural and social gaps. So there’s an intention
of stewardship, but there’s no commitment. There are differing expectations
between researchers, and funders, and repositories. There are gaps between
what happens at Franz repository and good practice. So there are cultural
distinctions. And then there are political
gaps, and those are sort of– there’s no wind in the sails. There are no stakeholders
willing to stand up and say, this is a real problem,
and it’s important for us to cover those other
social and resource gaps. And so the political problems
are often quite important. Now this explains a little bit
about what our 46 respondents, with 120 datasets, thought. And the interesting thing is
you think about this value. 40% says their data is of
indefinite value, but only 6% of the respondents
had a commitment. So you think there
there’s a disconnect there between what
the commitment is and what we think the value
is, sort of back to that. And back when I was
involved, as some of you were, with the Blue
Ribbon Task Force for Sustainable Digital
Preservation and Access, one of the things
to recognize is there is a bunch of players
in the digital world. There’s people who
make the data, and who benefit from the data, and who
select it and own it, or have rights to it, and preserve
the data and pay for it. In the private sector, those
groups are largely the same. If I google I collect the
clicks, I prize the clicks, the clicks give me
competitive advantage. I store the clicks, I use the
clicks, I pay for the clicks, I’m good with that. If I’m in academia, those are
often really different groups. The folks in the field generate
the data and everybody benefits from it. And the folks who generate the
data, often select the data. Maybe the university has
the rights to the data, sometimes it’s a little
hard to figured out. Maybe a third party
preserves the data. Maybe somebody else
entirely pays for the data. And when those groups
are not together, oftentimes you have this
stakeholder problem. The last thing I want to
say in this section is– so we all know it’s
a problem, you all know this more than anyone
in, like, the universe– and you guys are smart
and you’re out there beating the bushes all the time,
how come nothing is happening? And my own sense of that is
this stuff is hard to sell. Plumbing is hard to sell. This is sort of the
plumbing of the data world. And you don’t have
newsworthiness. When the water runs in the
sinks and the bathrooms, and I get services that serve
it up to me the way I need it– no one thinks about
that or remarks on that until something breaks. So oftentimes, providing good
infrastructure to support data is just not newsworthy. Now, on the other
hand, supercomputers. We’ve got it good there,
because somebody gives me a bunch of money, I run– I have a supercomputer, it’s
on the top of the top 500 list, I do some breakthrough
calculation, everybody’s happy about it. Never mind the fact it doesn’t
have to run consistently every single day. It can be down, I can get
a different slot of money for a different
supercomputer, and there doesn’t need to be continuity. So some kinds of
infrastructure kind of fit in our newsworthiness
and opportunity costs, and some do not. And data typically does not. Oddly enough, this
is a really good time for data preservation. And data preservation
has been, a lot, in the news with the
change in administration, and there are a lot of people
in the scientific community and elsewhere who are
spending a lot of time thinking about preservation. How many archiving and
preservation articles have you seen in the New
York Times and Washington Post, and other
science magazines, and other kinds of publications? I think something that
will be interesting for us is, OK, so
data we thought might go away is
now being preserved, but where’s it going? And how are we going
to find it now, and who’s going to sustain it? And so I think this is
sort of a first step in an evolutionary process,
with respect to that data. But in some sense,
maybe the next question is what are our
responsibilities around that? What can we do for our community
so that data will be there? So I guess I want to get back
to the original question. And so what repository metrics
speak to usage and users? And there’s kind of
four buckets of them. There’s a bunch of
things we can do to measure the number
of collections, the users or return
users, the web hits, or whatever,
users and usage. Empowerment, which
is– these are really collections that
bring about really interesting and
important new ideas. And so what do we
need to keep around that’s going to be a
value for doing that? We need to be responsible
for keeping things– our records collections,
or unique or hard to place collections. And then I think, with
respects to repositories in the organizations that
are storage for our data, we have to have adequate, what I
think of as being the ilities– reliability, predictability,
sustainability, affordability, accountability. All those things
are characteristics that it’s good for our
repositories to have. So last but not least,
here’s my goal– data value is available
for access and use, now and in the future. And if I level up and
think about the ecosystem in which this lives, how
does that ecosystem really– how do I create a
healthy ecosystem in which to support
these things? And I would offer to you that
I think a healthy ecosystem has three characteristics. First of all, I think it has
stakeholder bio-diversity. I think if we expect the
government, or Google, or Elsevier, or anyone
else to keep all the data, we’re not going to survive. I think it’s got to be
a shared responsibility. And I think the more we can
coordinate and make sure that we have the appropriate
kinds of partnerships, the better. Second of all, I
think that we have got to come to terms that the
provision of infrastructure to host our valuable
data is not research. I mean there’s research
involved in making these technologies
better and all of that– it is infrastructure. When I drive my car,
I am not worried about whether my
asphalt is innovative. I am worried about whether
my asphalt has potholes. I’m worried about whether
the road is smooth. I’m worried about whether
I can get from point A to point B. That’s how
my road is measured. This infrastructure should be
measured on it’s usefulness to the community. And that means that we
might need a longer time frame, and a different
way of measuring it, and a different way
of supporting it. And I think we’ve just got
to get realistic about that. I don’t think about
my light company as doing novel, new things. I just want lights in my house. And then, the last
thing I think we have to start thinking about,
which we’re already thinking about, is culture change. And these are not just
technical problems, they’re social problems as well. They’re political problems,
they’re community problems. And I think it’s
really important for us to think of those
more holistically. And I think that
really will help us get to where we
want to go, which is data value is available
for access and use, now and in the future. And with that, I thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *