Robust-first computing: Beyond efficiency

Robust-first computing: Beyond efficiency

For the safety of society, and to build really big computers, we should pursue
robust-first computing. A system is robust if it tends to do
something sensible even when its design assumptions are under stress or violated. Which sounds like a
good thing — you see ‘Robust’ on feature checklists —
but at the same time, a lot of computer science is about using computer hardware efficiently. And that also sounds like a good
thing, but efficiency and robustness are at
odds over redundancy which robustness requires, but efficiency eliminates. So for computers out in the real world, for systems meant
to have actual responsibility, we should put robustness
first, and view maximizing efficiency not
as an inherent virtue, but more as a regrettable sometimes
necessity. So that’s the core message of an essay
I’ve got in the October issue of CACM, and this
is a little video intro. So I want to start with a Why do computers crash? I’ve been asking that question around
the country over the last year and a half and people say: bad hardware, bugs, user
errors, malware. But the deeper reason is: we made ’em that way. We design computers to freeze up at the first surprise. Why do we do that? Well, as computer programmers, we were counting on there being no
surprises– everything being completely reliable, so
that we could just put all the steps that we needed in a row, and we basically never gave a second thought to
what would happen if any of those steps went wrong. That attitude goes all the way back to the earliest days of digital computing. And at the very
beginning it’s like there was a contract between the computer engineers and manufacturers on
one side and the computer scientists and programmers on the other side — though they weren’t
really called that — and the idea was: Hardware shall provide reliability. It
takes the unruly physical world and turns it into logic. Software’s job is to take logic and turn it into functions that
are valuable; turn into something that’s desirable
enough to provide enough revenue to pay for the hardware and the
software and everything together. And that’s the way the deal worked. At first, hardware was incredibly expensive and really weak. And that created incredible
pressure on the software to get as much desirability as possible
out of the sunk cost of whatever hardware you were dealing
with. And so there was incredible pressure to be correct for whatever desirability meant, and then to be
as efficient as possible. And this sort of ‘CEO’ idea
— that software’s about Correctness and Efficiency Only — is still
the backbone of computer science — algorithm design
and so forth. The world has changed from then. Hardware is now much much
cheaper, it’s often more powerful than we need
for some given task, and there are bugs everywhere. Correctness — it’s not even
clear what correctness means, in most cases. So the world has changed but in many ways we’re still carrying
this original software CEO attitude with us. For the safety of society, and to let us
build really big computers, we should change that attitude, and think
about putting robustness first and figuring out what that would mean.
Building a body of science and engineering knowledge around computing robust-first. So, that’s
the pitch Let’s do a demo and then wrap-up. So the example I do is sorting, specifically, pairwise comparison
sorting. So let’s, let’s see what that is
quickly. Alright, I’ve got a deck of cards here.. We’ve got cards, they’re sort of magic cards.
And we can %uh shuffle them and square them up %uh like that, okay? So if we take some of
these cards and whups.. The thing that makes these cards
weird is they are like alien cards from Roswell New Mexico, something like that. So if we were to try
to put these in order, how would we do it? You know, is
the two suns, with the thing, better than the green and the red.. Who knows? Well it turns out, in addition to
this alien card deck, we got an alien card comparison device. And that’s
what this thing up here is. And the way it works is, we stick a couple of
cards in it, and it crunches over them and it decides. Okay this one is higher than this one. So then we can stick
this one in and compare it to that guy. Alright well this one’s higher,
and so on. Do this, and alright this one’s higher. Alright so in fact that guy.. this guy just kind of
worked his way down to the bottom So now do we know that these guys are all in
order yet? Well we don’t exactly because we haven’t compared all of them to all of them, but we can try that.
It’s easier to move the comparison thing. Alright, those guys seem all right. These guys see they also wanted to swap. And finally
these guys.. Alright, they’re happy the way they are. So is that it? Are we done? Now it said they’re both equal. Well,
let’s say it’s all right. Those guys are still in order and those guys are still in order, OK.
So there it is. So what we’ve done here is what’s called ‘bubble sort’. or sort of a
part of the bubble sort algorithm. And bubble sort is kind of the black sheep of sorting algorithms. All we do is, well we take our comparison thing
and we compare adjacent cards, and if they come out out of order we swap them. Then we go back to the beginning
and we do it again, and we do it enough times until we’ve gotten everyone, so that’s enough time for — if
this guy on the right-hand end actually belonged all the way on the
other end, he has time to bubble one by one up to the other side: Bubble sort. Now, computer scientists love to hate
bubble sort because it’s so inefficient. But here’s the
twist: What if it’s the case that our alien
card comparison device fails sometimes? Sometimes gets it wrong. Kind of saw where things go up, and then they were equal If our comparison routine is not
guaranteed to be right, we can’t guarantee that we’re actually
going to get the correct answer in the ‘strict correct’ sense. So we need some kinda way of saying: Well
if you got it wrong, did you get it _really_ wrong? And so we made up this thing which we
called ‘positional error’. Okay? And so the idea is, if we were sorting
just these few numbers. 1234, and those are in correct order: 1 is
where it’s supposed to be, 23 and so on. So one is off by zero positions, 2 is
off by zero positions, and so on, they all add up. Here, 2 and 1 have been swapped, so 2
is one position off, it wants to be here. 1 is also one position off–wants to be
there, the others are fine, and so on. OK? If we get ’em exactly backwards that’s
one of the cases were it’s the worst possible positional error. The four wants to go 1,2,3
spots, the 1 wants to go 1,2,3 spots. The 2
are 3 want to reverse. Okay? So we could take our alien comparison device — which in
fact did act a little bit strange, we had that case where it seemed to score things differently when we tried it
times, now it seems like it’s all right now. And suppose instead of doing bubble sort
we did a more efficient sorting algorithm, in this
case. And the one I want to look at — oops — the
one I want to look at is called ‘merge sort’. And merge sort is one of the class a very efficient algorithms the
computer science has developed over the decades %um, and it’s, it’s being used a lot, in
fact it’s still getting more popular lately. almost
surely, if you have a general-purpose computer or phone, what-have-you, in your house there’s
probably an example of merge sort in your software
someplace and the idea is: You take whatever you need to
sort you break them into two groups you break those into two groups, until you
get down to tiny groups of two. You sort those, just sorting them by
swapping if need be, and then you start merging the little
groups of two into groups of four and then groups of four into groups of eight and
so on until you get all the way to the end. Okay? Now this is a clever idea for at
least two reasons. So the efficiency win #1 is once you
have these sorted groups that you’re putting together you only have to check the lead guy. “Is
this guy less than that guy?” Because you know all the rest of them are
bigger. So you can use the front guy as sort of a proxy for comparing against all the rest of each group. It’s very clever. Second win is, especially when we’re
getting to the end to the algorithm, is we’ve got these big groups that we’re merging
together when we decide to take this guy and then
this guy and then this guy, we’re moving those items long distances in the array. So if you think in terms of
positional error, like is this guy way out of position, well when we make a long swap, we’re reducing
the positional error a lot. And that’s what makes it efficient. All of the efficient sorting algorithms
for pairwise comparisons are based on that idea. But on the other
hand, the flip side of making it efficient is: What
if it doesn’t work right? What if there’s a failure in the
comparison? So here’s a case where it works. We have, we’ve split them and now we’re
comparing the twos. 7 is less than 8. 2 is going to be less than 4 so it
goes first. That ends the round of twos. Now we’re merging twos into
fours. And then we’ll do two twos into a four,
another two twos into a four and then finally we will take two fours and
merged them into an eight. Now, especially in this last round here,
the 2 was going a long distance over the ways. OK? So the increased efficiency comes from the fact that merge sort is
designed so that it can end up making these long distance moves. It does one comparison
then moves guys a long way so it has a large leverage. And the increased efficiency implies increased leverage and so if that’s the
case that the comparison routine might fail sometimes, then that leverage which just
helped us so much, can now come around and hurt us so much.
And here’s the data that we got: Imagine shuffling a deck of 52 cards and
trying to sort them using the alien comparison device, which gets the comparison right — less
than, greater than, or equal — ninety percent of the time but in this
example 10 percent of the time it just gives a crazy random answer. If we look at merge sort — that’s the guy in the
middle — or quick sort — another one of these efficient algorithms —
they get pretty well slaughtered by this. And on the y-axis going up and down we
have the total card positional error. Now, depending on how the shuffling
actually works, sometimes it’s worse and sometime it’s better, so these are averages. But, look at bubble sort. People look at this and they think: Well, you
know, this is easy to fix. We can just, you know, try it a buncha times
before deciding whether to swap or not, and so on. There’s more details in the essay The point I want to make is just this:
That because we have the software CEO mindset, that says “Reliability is someone else’s
problem!” and we can feel free to screw the
efficiency down as hard as we can, and feel like it’s a good day’s work — maybe that’s not the best approach
anymore. Maybe we have to say what we’re really looking for is not the
high-strung prima donna piece of software but team player software that gives better
than they get. So OK. It comes down to a question about what do we imagine the future of
computing? And not to put too fine a point on it
but what we have now, inside a single
computer, a CPU and RAM, is essentially sort of centrally
planned dictatorship. The CPU was there controlling everything
and all the peons are just cowering in the
memory, doing nothing, until they’re told exactly what to do and then they do it or they don’t. And it’s
easy to understand how to make such a thing; it seems logical from
one point of view, but it’s very fragile, and it only gets
so big. Eventually you can’t have that one
guy in the center saying what to do. He can’t go any faster. The alternative is
something more like free-market capitalism, something more
like democracy. Now I mean if you know, we look at in the
world and there’s like three coffee shops on every block. That’s clearly redundant. Most those
coffee shops are not operating at the peak efficiency
that they could if we scheduled when everybody was supposed to go get their
cup of coffee. But on the other hand, if one of those
coffee shops goes out of business, or poisons some of
its customers or whatever the other coffee shops are more than
happy to pick up the slack. The redundancy makes it robust. And if we
think about organizing computation on a sort of democratic capitalist, individual autonomy kinda
basis, then potentially, we can make a computer
system that scales indefinitely. All right. That’s it. That’s the robust-first computing concept. If you get it, you got it. Thanks for watching.

14 thoughts to “Robust-first computing: Beyond efficiency”

  1. This is a fifteen minute overview of the essay "Beyond Efficiency", which is appearing in the October issue of Communications of the ACM.  It's a little emphatic!  What do you think?

  2. Thanks for this video.
    I love to hear and think about new ideas especially if they break up this way of thinking which takes everything for granted and question the way things work right now.
    You're a really pleasant speaker.

    Please keep doing those videos.

  3. I love watching your videos, you're really smart. By the way I'd like to know if you have a case to share where Quicksort or Mergesort would fails.

  4. Swaps reduce spatial position error… Interesting! This reminds me of an computational analogue of Erik Verlinde's theory of gravity.

  5. Could we think of the Alien Card Comparison Device to a test of the relative entropy of pairs of Alien Cards?

  6. Great video. I was in Prof Ackley's programming class back in 2000 when I was a CS undergrad, I enjoyed his lectures a lot. Prof Ackley, do you remember the blocks program homework? We had to move blocks around in a grid until they bumped into something. My friends still tease me to this day how I was banging my head trying to figure out that homework. It's been 13 years now for me as a network driver developer, thanks much to your teaching and much grace.

  7. Thanks Dave.  I think I've always lived with or worked around the errors and limitations of other's contributions to a finished system.  I've never thought of system development so much in terms of dynamic and changing interactions of subsystems which may change over time and are created in companies with office politics and social and political tyranny, corruption, misconceptions and false paradigms.  My eyes are more open now.  I think I got it.

  8. Great food for thought. I took a couple of your classes back in 2006 and they were very foundational for both my studies and my career in software. Thank you for educating me both then and now.

  9. Have you got any feeling as to the kind of mathematics that would describe the robust systems that you are presenting? Maybe probabilistic logic of some sort, or even linear logic if you assume that performing a computation uses power that may not be available for another computation later. I have no idea.

Leave a Reply

Your email address will not be published. Required fields are marked *