DMS: Software Tool Infrastructure

DMS: Software Tool Infrastructure


>> Ira is the principal founder of company called
Semantic Designs in Austin. And he’s been working for over a decade on tools to handle
how to–-to analyze and maintain a lot of software systems. So, I invited him over to
give us a talk about the stuff that he’s been doing to see if any of his ideas or tools
could be applicable to software systems at Google. Sir, Ira.
>>BAXTER: Hi. Good morning. I’m Ira Baxter, CEO/CTO, chief follow washer of a small company
called Semantic Designs in Austin, Texas. We’re very interested in dealing with large
software systems, all right? We all see that software systems exist. Okay, if you look
at the ones that run large corporations you discover they’re actually very big. And they’re
not getting smaller, they’re growing. And so, the problem is we kind of see it, they’re
sort of like indicated by this galaxy. You know, you’re at some point here where you
can see this very large piece of software and it’s enormous and it’s getting bigger.
And the question is, “How are we going to deal with that over the long run?” So, I’m
going to talk to you today about tools that we have been designing and building, it’s
actually for about 15 years now. Okay, will actually save people some 15 years into a
5 year technical plan. Okay, to give you some idea of what we think these tools might look
like and what kinds of things you can do with them. So, Semantic Designs, we think our goal
is to change the economics of how software is built, at least for large-scale software.
Okay, if you built things by hand the way the Egyptians did, it’s going to take you
30 years to build a pyramid with 30,000 slaves. By anybody’s standards, that’s truly expensive.
If you want to change the architecture of the pyramids, that’s just not a good answer.
Okay, that’s not going to work. You need to do something different about it. The method
is obvious. Bring machinery to the table. Okay, you need automation to make this kind
of thing work. So, we say, “All right, bring automation and what–and what would it look
like?” Okay, the company was founded in 1995, okay, up off a core idea called DMS, which
I’ll talk about a lot during this presentation. Okay, and since then, we’ve been trying to
enhance this engine, put features in it, and facilities to help us deal with these large
systems. We’re small about 10 people. Okay, but half of the staff are PhD level computer
scientists with various backgrounds in operating systems, software engineering, AI, theorem
proving, so on and so forth. All right. So, it’s an interesting crop, they’re a good crowd
to work with. Okay. Now, a lot of companies, a lot of investors ask us, “What do you specialize
as a company? You specialize in software, what kind? They want to know if we specialize
in banking software or embedded software or medical software.” No, no, we specialize in
software, okay. In big pieces of stuff, because there’s a whole lot of machinery required
to deal with just the software in terms of its scale. So, we’re not specialists to any
problem area. We’re specialists with software systems in general. All right. Now, the tools
that we apply or to–or activities to are broad spectrum as a consequence. Okay. They
go from embedded systems, to aviation, in military software, to large-scale banking;
it’s all across the map. It’s kind of surprisingly broad and you’ll see a whole bunch of examples
I’ll go through here today. And we use these tools to carry off activities like architecture
extraction. We are protecting a software that is changing its shape, restructuring it, right?
Transferring it from one link to another and carrying out things like code generation.
That’s a very high-level view. So, we see software everywhere. Interesting systems have
oceans of code. People–it’s easy to find million line systems. All right, when you
find them, what you discover is there are million line systems but they’re coded in
five languages, they’re all connected together in complicated ways. And they’re not only
running on this machine but they’re spread across other machines and they communicate
through distributed methods and mailing files and doing FTP, everything you can imagine.
So, the challenges when you face this kind of systems are, how do you build this in a
timely way? This is a problem software engineering still doesn’t know how do very well. All right.
How do you understand the artifact that you built, if you’re going to make any kind of
changes to it? Almost every organization that has a big piece of software will tell you
that nobody here knows what the software does because all the people that built it left
or are dead, right? So, how do–how do you deal with a thing you don’t understand? What
do you do? All right. How do you deal with the quality of this stuff? If this thing is
running your corporation and it’s broken, your corporation’s going to hurt, right? This
is a pretty strong theme for upper management. Software quality needs to be good with your
large systems or it’s not good for the company. How do you make small changes, the kind of
change that people do everyday? How do I hammer on some little piece of it? Okay. And how
do I make massive changes? How do I reshape the system to meet the needs of the future?
How do I switch from the mainframe-based system, say, to an Internet-based system or go from
one data model to another kind of data model. And take two data models from two different
kinds of system and merge them together so I can get a system, which is smart across
two kinds of data. All right. Do you want to make all these changes of some kind? You
need automated tools to help. So, how do we do that? Well, we fundamentally offer two
kinds of support activities and tools to help people carry off these kinds of–deal with
these kinds of issues. The first one is analysis, okay? Help organizations with large amounts
of software kind of get it under control. Help them understand how the pieces are hooked
together; one information is flowing from one part to another, all right? And second
thing is, help them understand what the quality levels are. Help measure; is this code good
for some definition of good? Is it good because you’ve tested it well? Is it good because
it’s structured well? Is it good because it has a good architecture? Is it good because
it’s layered well? So, analysis is pretty big. Most of our customers when they first
come talk to us want to know something about what they have. Tell me about what I have.
But the question’s why, all right? Analysis is an interesting topic but nobody does it
unless they are afraid of the answer. When you go to a doctor, okay, and you’re feeling
ill, you want an analysis. “Doctor, doctor, tell me why I’m ill. Tell me what disease
I have.” But he can give you a very long complicated definition of your disease, a very long complicated
name and he can describe it to you, okay. And if at that point, you leave the doctor’s
office and the visit’s really unsatisfactory. Nobody really wants to know what the name
of the disease is, what they want put is a label on the name so that you can do something
about it. It’s the cure that matters not the analysis. So, analysis is done to support
change. So, the last thing that we do and I think the thing that makes us really, really
unique, okay, is carrying out mass change to large systems and use analysis to drive
change. Okay. So, do code generation. Okay. From some kind of a specification where the
analysis might require you, it will say, domain’s specific analysis upfront to invent a DSL
so you can do code generation from that. All right. Restructuring, take your system and
re-architect it, okay, to make it more useful. Modernization and rip out some kind of underlying
piece of technology which this piece of software depends upon and replace it or remove green
screens and replace them, okay, with–with HTML, remove a hierarchal database, replace
it with relational database. Switch from, a single threaded system to a multi-threaded
system. These are all massive changes and they’re not the kind of things that are easy
to do by hand. Okay. The last one is, like, migration, okay, converting from one system
to another and change–changing platforms while you’re at it. So, I’m going to give
you some examples of analysis and change that we’ve done in various kinds of tests, some
very quick ones. Okay. And let me talk about the technology we use to do that. So, on the
analysis side, okay, here’s a customer that came to us with an IBM mainframe system. Well,
they kept the mainframe in Australia. And they said, “We have 10 million lines of COBOL,”
okay, “3,000 JCL scripts,” okay, “6,000 input-output screens, 500 databases and we don’t know how
any of it’s tied together.” So, when they want to do simple impact analysis, we want
to change the way interest rates are paid to our customer base. They–first of all,
struggle to find a place to start. Well, we think this database holds interest rates.
Another question is what other parts of the software talk to that database so they can
think about modifying it. And that’s they can actually make the modification or acts
the architecture, but what pieces touch it? So they can actually start thinking about
the impact analysis. If you cannot do this on a very large scale, you can not schedule
things well, you can not time it well, you don’t know what your different levels are,
it’s going to hurt you with quality, you’re going to get hurt when you install it because
you didn’t think it through, because you didn’t find everything. Just finding all the pieces
and talking about how they’re connected is a really hard problem. So, one of the things
we did for them was we built them a custom analyzer that reads COBOL at the 10 million
lines scale. That’s 8,000 COBOL programs. It reads all the JCL, reads a peculiar thing
called a Hogan DB. This particular application system is called Hogan. It’s a banking system
that you can buy commercially from CSC. And the Hogan system happens to be an architecture
that makes it really hard to see how anything works. It contains a bunch of metadata that
says, “All these programs talk to each other through the way this metadata predicts.” So,
in order for them to see what’s happening, you have to understand how the COBOL programs
talk to ledger, how they talk to the databases and how they’re directed by the metadata in
the Hogan database. And so, we built them a custom tool who will read all of that stuff
and give them an answer. And I’ll show what that looks like in a little bit, okay, to
give you a sense of it. The main point is it’s 10 million lines of code. It’s bigger
than you are–if you try and pick it up it’ll squash you, all right? So, you need machinery
to do that. So, that’s the analysis side, okay? They are struggling to just understanding
what they have and their next question’s going to be, how do we make this thing better? That’s
the next step. This is a more interesting example. Okay. You’re not supposed to be able
to see that airplane. All right. That’s the B2 Stealth Bomber. Okay. B2s are made out
of essentially stealth materials and software, all right? So, half of the air–there’s about
150 microprocessor in an airplane. Okay. It’s a 25-year-old–30-year-old–30 year old design
at this point. So, which is using the best microprocessor you could get in 1975, right?
It has 16-bit CPUs that actually do floating point in the hardware. I didn’t know you could
get CPUs like that in 1975. But if you’re in the Air Force and you’re building this
thing, you could get that kind of stuff. So, people work software to run this airplane,
all right? So, there’s software to run the ailerons, there’s software to run the engines,
there’s software to manage the airplane, fly to Baghdad, drop a bomb, come home. The pilot
doesn’t do anything except for say no, right? If the plane goes down some path or somebody
says, “Abort the mission,” he says, “No, go around.” The mission software put–basically
runs the airplane. The Air Force thinks this is a really interesting device for, sort of,
force projection. Okay. And they would like to integrate this device into their “battle
sphere.” It’s amazing what kind of terms you come up with when you go in these different
arenas. They want to integrate their battle sphere and all they have to do to do that
is to take the software, which is running in an airplane, it was designed in 1975 and
didn’t know anything about the Internet, okay, your radio links or communications and replace
it by something they can enhance. So, this software’s written something called JOVIAL,
Jules Own Version of the International Algor Language. Designed by Jules Schwartz. It was
the best in meta programming language you could get in 1973. That was before anybody
had ever heard of C, right? Let’s see the next slide. The good news was they had 1.2
million lines of code that ran the airplane, did the mission software, all written in this
thing, it all works, it runs the airplane, everything’s fine. The bad news, the people
who wrote it are retired or dead. The development tools on which it was dealt, one of the [INDISTINCT].
So, development tools are written in software from a machine you can’t buy. You can’t even
find them anymore. Okay. So, this thing was a disaster of first magnitude in terms to
dealing with. They had to get out of JOVIAL and if you like, anything but JOVIAL would
have been their goal but they were happy to go to C. Here’s a complication. This is Black
Code. Top secret, we, SD, aren’t allowed to know what’s in it or see it. “Can you please
convert this to C without making any mistakes?” Ugh. The answer is yes. Okay. All you need
is enough automation and you can do that. Okay. So, it’s 100% automated conversion.
It’s now flying in the B2s that are being used around. We’re very happy with this particular
thing. Now, I have to kill you since you’ve seen the airplane. Okay. So, we’ve seen an
analysis example and a transformation example, okay, to give you an idea of the kind of things
that we do. So, let’s talk about how we do that, okay? Now, if you had to build a tool
to build each one of those things, each one of those anal–the analysis engine with the
B2 thing, and you had to build it from scratch, you’d simply die, okay? There’s just too much
machinery involved in doing that. We have a very simple insight and the very simple
insight says that when you build large complicated tool for dealing with software, that much
of the infrastructure they have is the same as every other tool you built. Okay. You need
parsing, you need analysis, you need this and that. You’re going to hear that theme
a lot in this talk. So, we did a simple thing. We built an engine that has a lot of shared
infrastructure. Think of this as an operating system supporting application programs, which
are software engineering tools. So, that’s what DMS is. It has this long complicated
name, which is another hour of talk as to how it got this long complicated name, okay?
So, its official name is DMS Software Reengineering Toolkit, we call it DMS for short. Okay. So,
what is this? So, it’s a device not for being a tool that does something directly. It never
does anything by itself. Okay. What it does is it manufactures the tool but does what
we need. So, you can think of this tool as being something that takes in a bunch of description,
a bunch of raw materials for describing what has to be processed and spits out a device
for carrying off a specific task at hand. So, what we actually do is we take DMS, we
manufacture tool with DMS and then we apply the manufactured tool to the problem task.
Okay. And the good thing is we can use raw materials for DMS over and over again. Now,
there’s a list of applications over here on the left. Okay. It talks about various kinds
of things we’ve done. Formatters, migration tools, test coverage tools, code generation
from C++ and I’m not going to describe those in any detail here. What I want you to do
is see that that list is long and they work really different. Okay. And you wouldn’t have
guessed if I hadn’t probably preloaded the question that those all might be highly related
but they are, okay? You people are related to fruit flies by your chromosomes, share
about 85% of them. It takes a certain amount of stuff to just be alive, okay? And that’s
the observation. Eighty-five percent of those things are the same. In fact, they’re experience
is more like 95% and that’s the win here. So, we’ll talk about what’s in this engine
to give people some idea how we can do this. It’s very simple idea. There’s two things
we put together. We put together compiler technology that the compiler engineers have
been building for the last 50 years, okay? If you’re going to deal with large pieces
of software, your goal with the compiler guys do at your peril. It’s real simple. You need
this machinery. So, we’ve taken a lot of this machinery and integrated it in the DMS. Okay.
Now, that machinery is mostly compiler stuff. Compilers take in source code, they analyze
it, they look for special cases, they do code generation from it, they spit out something
else. That’s what they do. Okay. Now, if you specialize it to a particular input language
and a particular code generation style and a particular kind of set of optimizations
and a particular kind of binary output, you get a particular compiler like GCC, okay,
that’s a nice thought. Okay, it’s pretty useful. There’s lots of other compilers built around
that. The difference between what we do and those kinds of compilers is we’ve generalized
the daylights out of the boxes, right? So, you need to parse the input languages whatever
language you need to read, you need to read–you need to read them from their external textual
form that people deal with into an internal form with are essential compiler data structures,
which I’ve symbolized by abstract syntax trees up here, right? So, and you’d like to parameterize
that by language definition so you could feed it lots and lots of different languages and
have it process them all. You need an analysis engine because you need to ask questions about
your software. If you’re going to remediate a problem, first of all, you need to know
where. So you need analysis machinery, and you can set up general kinds of analysis,
okay, you configure them to ask the questions that you want. All right. Now, most compilers
will emit error messages. In this diagram, you see them coming out over in the right
called analysis results. So you might have the tool just focus on analysis and print
a report. And we’ll see examples of that further on. Well, but more interestingly, what you
want in analysis tool is to drive the signal from this analysis results, if I can find
it. This arc here from–and now let’s analyze transform, this is the key interesting arc,
okay, analyze to find my problem and use the analyzer to drive a change to cause an impact,
cause and effect on my system. And we feed that to a transform engine, what the transform
engine does; it maps compiler data structures to compiler data structures. Okay. If you
look at the way compilers work internally, that’s what they do. This is the middle five
chapters in any compiler book you’ll see, how to map from–from a high level form to
a low level form. That’s what it is. But you can generalize that, okay, with the notion
called Program Transformations, which we do and I’ll talk about that in the next slide.
Then there’s the formatting, which in our case, is mostly about taking the compiler
data structures and converting it back to source code. Because what programmers would
want is control their system, put my code in, have something have to do, have it come
back up, have it mostly read my code with improvements in it. So our formatters basically
will generate source code. So, they’re kind of funny compared to some kinds of objects,
the kinds of things you have with a regular compiler. Now, the way we deal with this is
we parameterize these boxes, okay, with the various kinds of things that it takes to do
that. And if you’ll see down here–it looks like we have a damaged arrow in this diagram.
There’s supposed to be an arrow from this list of language definitions in this rule
compiler and I don’t know why it’s damaged. All right. In any case, the rule compiler
accepts two things. It accepts language definitions in the form of what you might think of as
BNF augmented with a bunch of other things about how to analyze that language particular
properties asked. And it accepts a tool definition, okay, which is what we want the tool to do.
What kind of analysis we want it to perform, how we want it to use that analysis to drive
the transformational change. All right. So, the stuff in the gray box there stays constant
in the same way your operating system stays constant. These pieces are just there. When
you feed these two things and wiggle them in order to get the effect that we’re looking
for. And that’s the principal value we think we bring to the table. Here’s a piece of machinery
that lots of organizations could use, okay, by wiggling these inputs for their particular
tasks. All right. So, what makes this interesting? Well, there’s the fact that the tool could
provide some kind of understanding. Help me understand the source code by–for low level
parsers, the compiler data structures make it possible to do algorithmic analysis over
the code and get some kind of answers, okay, and we actually had deep information flow
analysis. If you really want to understand what happens in the system, you have to know
when information flows in and where does it go and what happens to it? That’s a flow analysis
question. So, fundamentally, flow analysis coordinates, right. All right. The core of
this is the transformation engine; I’ll talk about that in more detail in just a moment.
Okay. And the other issue that makes this tool interesting is not just the sort of theory
picture. Okay, it looks nice picture, that’s cool, you could find in an academic paper.
Okay, I wrote one like that about 25 years ago. Okay, but rather because it actually
works on real systems, it’ll works on scale. We’ve actually dealt with systems who have
millions of lines of code, multiple languages, mixed languages, there’s a computational level
on them underneath, there’s a parallel processing engine to support the size of computations
we’re doing, because doing symbolic computation like the million and 10 million line scales
are expensive. There’s a parallel process system underneath DMS. That’s all I’m going
to say about today. But I’ll have you talk about it offline. Right. And it’s been actively
using our hands for over a decade, okay, by a small but very clever team, I think. Okay.
So, let’s talk about program transformations. This is not going to be my day. I need to
advance this slide to the point we can see it all. I don’t know why. I’m not getting
it projected the way I want. All right. Let’s start here. There was an idea back in 1969
that says, “It would be really cool if we could synthesize programs from scratch.” All
right. And what this guy said was, “You imagine something,” that’s the requirements bubble
over there on the left, “And I’ll write down some formal specification for it in my formal
language.” Whatever that was, and then all of those, there’s dozens of foreign language
had been proposed. And then we’ll get some magic engine that takes that formal language
spec and generates code at the backside and produces the program, okay. That idea has
both been famously a failure and a success, okay. The high-level version of it, really
high-level requirements and specification had been a complete disaster. I don’t know
if anybody can really do this very well. The low-level version of it has been a gorgeous
success to the point where everybody forgets about it. They’re called compilers, right?
Everybody has one, everybody uses one, nobody thinks about, it just works. But if you push
on the high-level version of it, okay, what you realize in going from a spec to that final
program in a single step like, look, it takes in two million lines of spec and it produces
10 million lines of program and it does that in one step, seems pretty hard. Really hard
to believe. The only hope you have in doing something like this, let’s say, why don’t
we do this in stages? Let’s somehow incrementally convert that specification into the final
implementation by applying a whole bunch of knowledge, which we’ll call rules that map,
that spec to the code. And then the lower part of the slide, I’m going to show you a
little simple version of that. As you can imagine down here in the lower left, you can
see we have a little specification. It’s written in Algebra because this way, I don’t have
to explain–explain the spec language to you. I assume you all have had algebra, right?
So, that’s a specification and what we want to do is compile it to a more efficient program.
Okay, compilation means more efficient for some definition of more efficient. Most people
think more efficient means compile the machine instruction, that’s one definition, another
one is it is just takes less time to run. That’s the version of compile, all right?
Because that’s what’s everybody is really looking for. So, on this slide, we’re going
to do the second one, it takes less time to run. We’re going to do that by basically simplifying
this program, applying optimizations to it. And with the optimizations we apply are all
the ones you learned in 9th grade. So, this first step goes from our simple spec here
to the second spec and it applies the distributive law as you learned in the 9th grade. And then,
when you use a unity multiplier to get rid of one times Y and convert it to Y. And then
we apply a few more of these very simple 9th grade transformations you know about to get
to our final program. This is a compiled program in the sense that it’s better. It’s smaller
if you build a DUMB interpreter to interpret the left one and the right one, the right
one will run faster. So, it’s compiled. So, this achieves the idea. It doesn’t achieve
the sort of scale that you want, think about 20 million in transformational step, but I
couldn’t get those in a slide. So, I tried to keep it simple. But this is the sort of
idea. We want to take this analogue and apply it to software where the thing over there
on the left is a formal spec or a piece of code, or piece of object code, any kind of
a formal document and the rules will key our pieces of knowledge which will legitimately
tell you how you could manipulate this program to make it change. So, the issue is, how can
you give it different language definitions, how can you feed it a lot of rules? All right,
well, what does that really look like? This is what transformations look like when you
go from the algebra world to the programming world. All right, so this is the transformation
that maps from the source domain in COBOL. Right, that’s some kind of COBOL program,
and what I’d like to do, is I’d like to convert this COBOL program into CSharp. Okay, we can
talk about whether this is a good idea or not, but certainly a popular idea of lots
of large mainframe guys right now, all right? So, there’s lots of interest in that and being,
well, money-grabbing company, we help people to do this kind of thing. So, how do you get
from COBOL to CSharp? Well, the answer is convert each of the constructs in COBOL into
a corresponding construct in CSharp, which that will give you equivalence. And so, here’s
a rule that does a piece of that. This rule takes care of the special case in COBOL where
you could say add some variable to some other variable. That’s a COBOL phrase, it’s legal,
for some source expression, some target expression, and some target variable, okay. That add v1
to v2 in COBOL rewrites to the slot called v2 in some object, okay? It’s updated by the
value that you computed from v1 from other object it came from. So, hiding behind this
thing is an analysis that “object” thing is a metacode says I need to go off and do
an analysis to figure out where we decided to put this object working in the target system.
So, it’s hiding an analysis there. There’s a second analysis there that says, “You can
only do this if v2 is actually represented in the target system as an integer”. It doesn’t
work if it’s a decimal number, because CSharp doesn’t do decimal add plus. It won’t do that,
okay? So, what we see on this particular transformation is a syntax directed on the source. If you
see this piece of syntax here, replace it by that piece of syntax there. That’s what–that’s
mathematics. It’s called Equational Equivalence, A=B. Replace this by that, okay, do some analysis
to figure out what to replace it with and do some analysis–add some conditions like
if represented as integer. If you take that transformation and you apply it to the “Before”
piece of COBOL down there at the bottom, and you run it with some other transforms that
are allowed to actually do the object look-ups, and things like that. You can get out something
like Invoice.ShipTotal +=, that should say +=, Order.WidgetCount. So, there’s a transformation.
All right. Now, we write those transformations as source text because we’re engineers and
we don’t know how to recompile our data structures easily. As engineers, we want to work with
that. But the tool can’t work with text. It has to work with complier data structures,
and so the way it handles this kind of stuff, is it takes those rewrite rules and it transforms
with the same kind of compiler data structures as the actual source codes so it can match
them all up. And that’s how it works internally. So, it maps from that structure to that structure.
Okay. Now, if I give you one rule, you’re not a mathematician, okay. X + 0 won’t make
you a–it won’t get you job, okay, at John Hopkins, right? So, if you want to be really
good at mathematics, you have to take a lot of courses. You have to learn the various
kinds of mathematical systems there are, the various ways to write down formulations, the
various ways to manipulate those things, there’s a bunch of rules. And if you’re going to be
really good at, you’re going to spend a long time doing this. And if you do, you could
do spectacular things with mathematics. DMS is the same way. If you give it one rule,
you get a cute example like the one I had in our previous slide. But it’s not really
impressive by itself. You get three rules, nah, you don’t get much. You get 50 rules,
you start to get to a place where you can do things like building test coverage tools
and we’ll see some examples of that down the road. And you give it several thousand rules
and you could do spectacular things. You can move the pyramids. Okay. In particular, you
can move the v2, all right? It’s just a matter of scale. So, number of rules matters. Now,
we talked about the transformational aspect of the tool, okay. It’s supported by a bunch
of analysis engines. What I said before was that flow analysis is important. How does
information flow throughout your system? We don’t do any magic here. Okay, what we do
is we implement the compiler technology you can find in the Stanford Computer Science
Bookstore. Go to the back wall, there’s a bunch of compiler books, pick up the first
book on the left, implement that, put that one away and pick up the second book, implement
that, put that one away, repeat. We’re about halfway down the wall, right? Fundamentally,
what they say is you need to figure out how information flows throughout your system,
okay? So, computing flow graphs, information flows, used definition change, definition
use change, points to analysis, all these things are interesting. So, this is an example
of a flow graph, okay, decorated with data flows, okay. For that little tiny spot up
there at the very top, which I’m sure you can’t read much if your eyesight’s far better
than mine, okay? That’s a Fibonacci Program. All right? This is all the stuff–oops. This
graph that you get is all the information flows that happen inside that Fibonacci program.
My first comment is, “Man, no wonder programming is hard.” Okay, you have to understand all
these kinds of relationships in some sense in order to belive this program is right,
all right? If you can collect these kind of information flows, then you can discover that
that some event here in your code, at this point, can have an impact downstream on there.
Whereas that event, doesn’t have any downstream impact on this part, and being able to simply
separate the part of the program which is important for the task you care about is fundamental,
just to focusing your attention. Right. So, you need the–you need the flow analysis just
to help find out where things happen to help separate things that are irrelevant to you,
for things that are not irrelevant to you and for understanding how information flows
around the entire system. So, this machinery is built in the DMS, okay, and it’s been applied
to a number of different languages. It’s a generic subsystem where we connected different
language by providing some extra information to the language definition. If you have this
language definition, here’s how you feed the flow analysis machinery. Here’s the other
analyzer, this is the range analysis. How big is this value? Okay, why would I want
to know this? Well, I want to know if I’m getting overflow, I want to know if I, you
know, exceed my storage demand on some array, I want to know how much storage to allocate
for this particular variable. I’d like to know all kind of things, right? So, this analysis
is done using abstract interpretation over symbolic range constraints. The symbolic range
constraint is a very short formula, okay, involving two constants–I’m sorry, three
constants, A, B and C, and two program variables X and Y, that says I can discover this particular
property is true of this pair of variables at some point in the code. And we have examples
of that up here. Okay, there’s a small program. Okay, going in the program, we know and we
know nothing at all, okay, coming out of this first conditional what we know is minus I
is less than minus four, that’s a complicated way. When you write this down, it’s the complicated
way to say, I is greater than or equal to four. It shouldn’t surprise you from the conditional,
right? And likewise, I is less than or equal to three if you come up this way on the branch.
If you take these–nah. If you take these range constraints and you propagate them through
the various kinds of statements that occur here, you end up collecting more information
as you go through the program. Think of this as a symbolic simulation on what’s happening
in the code. What do I know about the answers? In this case, we got a fork here, coming off
this way, it collects information, we have a fork coming out this way. And down here
we have a join where these two set of facts come together. And so, all we know at this
point is the intersection between those two sets of facts. We take that intersection and
we end up down here with this particular facts as minus K is less than or equal to minus
three, and that might not seem very exciting until you think about–what would that really
means is K must be greater than three, okay. And there’s no upper bound on K. What that
means is that this array access down here is going to fail because there has to be an
upper bound on K. So in essence, this kind of analysis helps you find things like subscript
faults, all right? So it’s just an example. All right, so that’s an analysis. So, what’s
in the engine? Well, it’s a pile of gears. It’s a toolkit. You screw them to gather the
way you want to get the output that you’re looking for. It contains parsers, it contains
pretty printers, which are anti-parsers, that’s all they are, just inverse functions. Okay,
it contains a surprising number of mature front ends for tough languages, okay. There
are about four or five C++ parsers on the planet. We have one of them. All right, because
the machinery we built underneath it. So, C++, C, Python, Fortran, Ada, ABAP, okay?
We have a large number of well tested front ends that we’ve applied in serious commercial
context, all right? And it takes a long time to get the details on those things right,
and we’ve spend about the last five years busy tuning our C++ run in to try and deal
with all the various word peccadilloes that show up when you deal with the C++ and you
deal with the dialects that come with C++. Okay, Visual Studio’s C++ is not the same.
Okay, there’s GCC-C++. There are different sort of ways and your reasoning turns out
to be different depending on which compiler you’re using, right? There’s–there’s a part
down here called the Procedural API. This is what you find in the compiler books. I
have a tree, I have a symbol table, I have a flow graph, and here’s a bunch of APIs I
can use to manipulate those things. That’s the very bottom of our system; it shouldn’t
surprise anybody, okay? It’s what you do if you need a compiler. We try not to use it,
okay? What we’re trying to use instead, so we try and use these various kinds of analyzers,
we’ve talked about them, control flow, data flow, symbolic range analysis. And, we try
and use the rewrite engine. Okay, you just pattern matching language as you come–if
you see this replace it by that. Okay, either in different–from one language to another
or from the same language to itself for optimization purposes, right. And it’s at the bottom of
the stack, I lied. I said I’m going to say once. I lied. There’s a parallel programming
language at the bottom in the discussion. Main point is a big pile of gears; you screw
them together to get the effect you’re looking for. So, this is kind of another view of it,
okay? So, we got that compiler front end. It actually reads these various definitions,
dumps them into a set of internal databases and they drive essentially the evaluators
and the transformers. And then, we have a set of separate subsystems down here that
actually carry off these–oops. That carry off this analysis and the abstract, and can
be tied to a particular programming language by the description that we gave it on that
input over there. Okay. So, we’re going to take and look at some other applications of
DMS, just to give us some sense of the kind of things we’ve done with it. Okay. We found
that–well, a lot of people want to have really good information with their system and they
want to have it now. But sometimes you can’t get really good information fastest. So, maybe
it’d be okay to get pretty good information really fast, right? So, we built a thing called
the Search Engine. This is a tool for fishing around in large bodies of text. Okay, it doesn’t
have any really deep knowledge of the text, in fact, basically, what it does, it takes
a source code and lexes it according to the lexemes of the language. But when you got
15 different languages, you can lex them all according to the specific language. Then you
can build a query language over those lexemes. And now, you can search for things, this is
a search in a language called–I think this is Natural? This is Natural. This is a 4GL.
This is a programming language for business stuff, okay, in the 1985s, 1970s and 80s.
Okay. And what we wanted to find out was, how many input screens do this thing have,
okay, across this 1.3 million lines of code? Because a customer who came to us and said,
“We think we have…,” I forgot what they said. “We think we have something on your
1,600 input screens,” right, “in our application. And we’d like you to convert it. Now, please
give us a fixed price bid for this based upon 1,600 screens.” Would you trust them? Right?
The answer’s no. Go check it out for yourself, okay? We found–and this is we found 1,112
of these things, okay? These are not the input screens they were talking about, but they’re
input screens. There are a thousand screens that they didn’t tell us about. So, this is
really good in for us. So, this is good information really fast. So, this is a search across a
system. You can kind of think of this as Google for quote. Okay, it’s only internal as opposed
to external. I think you guys already have a tool do something like this, this is used
for fishing around internally. So, this is okay information but really fast. We talked
about the mainframe problem, the guy with the 10 million lines of COBOL. Okay, what
they wanted essentially was a picture like this and this, in fact, is exactly the picture
we gave them, okay? So, this is a COBOL program here in the center. Okay. And it reads this
database. And it reads this database, it writes to this flat file, okay, and it’s controlled
by this piece of JCL. So basically it says, you point me to a module and I’ll tell you
how it’s connected to its neighbors, okay? And they got a course they can, you know,
search mechanism over here so they can figure out which components they want to look at.
This allows them to look at their 18,000 components, one small subgraph at a time and get some
sense. If I make a change to this program, who do I have to worry about? So, it’s a very
simple answer both in concept in terms of delivery, very hard answer to get because
you have to read all these stuff and get all the details right, do this flow analysis across
these 10 million lines of code. All right. Here’s a second kind of analysis dome with
DMS, okay. People have clones in their code. Anybody here not clone any code? No hands.
Okay. It’s a surprise. Everybody knows they have clones. They know how much. Okay, it’s
a barbital take with anybody. You got 10% or more clones in your code. Anybody, I’ll
take the bet, right. All right. The reason we know that is we have been running this
tool for detecting clones for the last 10 years over everything you can imagine from
Python to Fortran to Visual Basic 6, okay, and it comes out 10% to 20% or worst on everything
we see. Okay, the Sun guys, the only guys that beat it; it was 9.87%, okay, with the
JDK. Somebody that’s actually working hard. I’m impressed. So, here’s a tool for locating
duplicated code. Somebody’s found–wrote something that was useful and they cloned it someplace
else. The good news about that is its software reuse, it makes it more effective. The bad
news about it is its [INDISTINCT], okay, whatever you cloned has some kind of problem in it
in the future. Maybe it’s not actually wrong, okay, but you can change your mind about the
architecture and it contains an architecture decision and you replicated it, you’re in
trouble. There’s another way to think about it. Imagine your source code base is 20% redundant.
What that means is if you pick a random line of code in it somewhere, there’s a 20% chance
that someplace else in that system that same line of code exists. And you decide you have
to change it random line of code and improve it. That means there’s 20% chance that someplace
else in this system there’s another line of code that you should fix. How many of you
know where that other line is? So, what you’d like to have is a tool for locating where
all these clones are, telling you where they are and showing it to you. All right. So,
here’s an example from Python. Okay, this is the–we usually–this is the Bio-Python
test something like that. And people could see down here if they look. They can see it
says 11.1–11.7% redundant. Okay, across something like 202,000 lines of code. Okay, this is
a small system from our point of view, this is–these sort of things get bigger, and this
number goes up. So, this is just a summary. The work summary report from the tool, okay.
As a sample of a clone, I didn’t take one of the PH from Python; I took one from PHP
just to show that we’re different. We have a PHP front end for DMS. So, this is a cloned
piece of code. Okay, you can see it’s occurred in three places. Okay, it’s in three separate
files. Okay, these clones happen to be exactly the same. One hundred percent items copy paste
on it, okay. And most of them are not, okay, and most of them were slightly different and
[INDISTINCT] how to detect that. Okay. This kind of tool works on any language that DMS
can process. And if we run into a language that DMS can’t process for this, well we define
the input to it and then this runs. We had a customer in Austin that said, “I think I
got clones in my code, and I’d like to manage them.” And they came to us and they worked
with us for a while. And after they did this exercise, they drew a size curve on their
system, and there’s a size curve, and it went up and it went up and I’m not very surprised
by this, the systems get bigger. And then it did something that’s shocking in the software
world. It went down. Okay. And guess what? That reflection point is. They started using
the clone detector, all right? So, here’s a way to manage your system size. Here’s a
way to cut your engineering cost on large-scale systems. Go find the clones, go manage the
codes, control the clones. Make sure you’re processing clones, okay? He was very happy
with this as a tool. All right. So, that’s kind of how things are similar in the big.
All right, it’s also true, if you’d like to see how things are similar in the small. Everybody
uses the diff tool, okay? Show me–to take two files, one of which I’ve edited and show
me how they’re different. Now, what diff does is that it shows you how they’re different
on a line-by-line basis. Nobody I know of uses a line editor. So, it seems like really
the wrong answer, okay? If you think about the way people work with code, what they do
is they say here–an expression or a statement or a block, and I need to do something to
this. I need to move it, I need to modify it, I need to change it, I need to delete
it, I need to insert, I need to copy it. They think like that. At least I think like that,
I don’t know about the rest of the world. And so a tool–the total differences between
two pieces of code in that vocabulary seems like it would be much more effective; you
change this variable, you move this block code not–this line’s different, somehow.
Okay, so an engine that we built called the Smart Differencer uses the same machineries
the clone detector does in the small; look for two things that are the same. Okay. And
instead of showing you what’s the same, it shows you what’s not the same. It shows you
the compliments, just the other answer. Right. And that gives you essentially a Levenshtein
difference between the Abstract Syntax Trees, what’s the smallest set of changes you might
make to get from one tree to the other and you can think about it as being a plausible
editing path, how did he modify this program to get to that? And it comes out and hand
you things like this, a block of code got moved? One of its more interesting aspects
is you rename all the variables in the scope this way, as supposed to I edited 47 places.
Okay. It’s completely insensitive to, you know, the formatting and the comments, so
if you change the formatting of the text, it doesn’t get fooled by any of that stuff.
So this is a really useful tool in the small. Okay. Here’s kind of a quick example. Okay,
this is done in COBOL. All right. So we look over here on the left, we have an original
piece of code. We look over in the right we have a modified piece of code. And what you’ll
see is the term name, term name got rename consistently term title and term title throughout
this piece of code, so essentially to the diff, painted up these two screens that you’ll
know the difference is so that you can see it. If you’re a code reviewer and you have
to see how your code got changed, this is a much, much easier way to see what actually
happened. Okay. What we talked about migrations? Okay, and so we used this sort of like nice
scene from we’ve seen from nature shows about this guy leading these birds on migration
and then we have kind of the stealth bomber up here in the corner–whoop. All right, so
we’ve talked about migration, I think I’ve made that point, right? So if you need to
move from one system to another, change platforms, change languages this does all that. All right.
Now that was transformation as opposed to analysis. You can use transformation to support
analysis, okay? In particular you can use transformation to instrument your code to
collect data as it runs to help you decide how good it is. So a fundamental question
before I ship my code is did I test it? Everybody says, “You should do unit testing,” we all
buy that, right? You all write unit test. But do you know how much of your code you
tested? Okay. If you tested five lines of your code with your thousand unit test, I
don’t want to ship it, I should be afraid. So you’d like to have a tool that essentially
says, “Tell me what part of my code got executed by my test?” The test all pass and most of
my code got executed I might feel comfortable about shipping it. The test all pass and hardly
any of the code got executed I should be scared to death. So I have no idea what that line
of code that I never executed does, I just don’t know, I have no test. So what test coverage
tool does, it instruments the code to collect execution information, I got here. It does
that massively over the entire system, you run the code, it collects all the probes,
it displays in a nice format, paints them up on a nice picture like this. So the red
part here is a piece of the C++ code that did not get executed, meaning red for stop
and green means the stuff got executed, green for go. All right, you can use that kind of
instrumentation to do not only test coverage but profiling, both counting and timing profiling,
okay, and the style’s the same. Instrument–the program language whatever, the kinds of probes
you need for this particular language. So the transformations vary a little bit but
the style doesn’t change at all. Okay. Some of our customers ship intellectual property
to each other. They want the receiver to use it, but they don’t want them to understand
it and they want them to ship it again. So, what they actually want to do is to take the
code which is shipped and scramble it in a way that makes it impossible for a human engineer
to know what’s going on. Oops. So here’s a piece of Verilog, it’s a chip design. Some
guys says, “I want them to design my chip, and I will sell it to you, and I want you
to integrate it in your chip. But I don’t want you to ship it to a third party.” So
what they’ll do, instead of shipping that thing, is they’ll ship you something like
this, this is code obfuscation, take the code and scramble it, change all the names, remove
the formatting, remove all stuff put your queues for people. Okay. So that they can
ship this and have some kind of technical support for intellectual property safety.
All right, there’s a lot of other applications of DMS that we’ve done, okay, I’m not going
to go from the detail, I’ll be leaving the slides behind so people can look at it, they
could take a look at the slide in general. Okay. But they cover the spectrum from embedded
systems, okay, to SOA, to other migrations, okay, the generating vector machine code,
okay, for parallel machines. These are all done with DMS, basically by carrying out transformations
on source code. We get continually asked, “Can you do other new things?” People show
up and ask us all kinds of questions, some we can–some we say, “Yeah. We think so, some
we can’t yet.” These are sort of things that we’re talking about with people in sort of
serious ways at the moment. Okay. Migrating applications for mainframes, we have a bid
in to the University of California at Santa Barbara, to migrate their student information
system off of Natural in a mainframe into CSharp, all right. Particular task that they
want done, right? Control developments. Control development cost by minimizing the software
base. Okay. We have a customer, it’s got 55 million lines of Java code, they’re pretty
sure half of it’s dead, they don’t know which half. Help them find it; help them remove
it, that’s a big task to do by hand. Okay. I got multi-core CPUs, we’re talking to–able
to say a big packet switching vendor. Okay. They’ve built themselves a packet switching
piece of software which they make a lot of money off of. And it was assigned around a
single threaded execution model. Suddenly they have 32 CPUs and their code’s not ready
for it. What now brown cow? Okay. You need to analyze–at scale to understand where you
can break it apart and what you might to do to paralyze the thing, right? Well U.S. Navy
came to us awhile ago and said we’re buying software from Czechoslovakia because we hate
being [INDISTINCT]. Okay. And the good news it’s cheap and the bad news is we don’t trust
it. So now the question is, can you find malicious software? That’s not an easy task to do but
you need to build tools to analyze the stealth and help them poke around, so on and so forth.
We’re currently talking to university who wants to take C code, okay, for high speed
image processing algorithms. Pick out the inner loops and covert these inner loops in
the FPGA so they can build basically a code design system, okay. In which they can have
low speed C code and high speed FPGA for the core of the algorithms. All–that’s all–that
whole trick is just basically driven off the flow analysis stuff, that graph you saw for
the Fibonacci’s perfect for what they want to do. So I got invited out here after a conversation
with some other folks at Google. Okay. And so I thought I’d put one possible Google application
to this. Okay. And as I understand it there’s, you know, a building process–a construction
process that says I have a very large library to support utilities. I have a large number
of applications and there’s a dependency network between them. And we’d like to have is a build
process that says, okay, when I touch one of these things on which something else is
dependent, okay, that all of the things downstream get compiled, that’s good. We all call it
make or whatever it is you happen to do. Okay. And so here we have a model in which of you
change. The Z, we’d like to recompile Y and Q and G and then A. Okay. That guarantees
that when somebody modifies Z that A gets updated. The problem with it is it’s granular.
Okay. It has its post on almost everybody’s world, it’s–the dependency–dang–the dependency
is based upon not the details of Z but rather Z itself as a thing. Okay? So as its Y depends
upon Z as a whole, so if anything in Z changes you rebuild Y. Even if Y didn’t use the thing
that Z changed. So here we have an example. We look inside, what you see is Z has some
subcomponents; Z1 and Z2 and Y depends really on Z1, okay? M depends upon Z2. Now, if I
change Z2 the question is do I want to rebuild A? The answer’s no. I really don’t want to
do that. And the secret to doing that is essentially something like the Smart Differencer. What
you want to understand is; what’s the difference between the old Z and the new Z is you can
say, “Hey, this is where changes got made, here’s Z2.” And then you want to look up the
dependence relationships between Z and Y and Z and M and say, “Does Y actually use Z2?
No. You don’t have to rebuild this. So you could build a smarter make by looking at more
fine green, okay, in the code, okay, and looking at dependencies. It doesn’t change the fundamental
model how you build things. But change the details of how you do it and it cuts the cost
radically, well we think anyway. Especially if you do this on the scale I think you guys
are doing. Okay. All right. So we’ve talked about a piece of machinery, okay, we haven’t
talked about sort of the ideas behind and the drivers behind it, okay? We think that
not only is the machinery useful, but our perspective as a company is useful. Okay,
we’ve been doing this for 15 years. We think about this differently than most people do
at this stage of the game because we ask, “What–what kind of hammer can I hit it with?”
Not how can I do it with a team of people, okay. And it takes awhile to get used to that
idea because it’s amazing how many different ways you can hit with this kind of hammer
if you think about it hard. Okay. The engine, we bring the idea that you need semantic and
precise analysis opposed to people precise analysis, people are good, they’re not good
at scale. All right. That you want to deal with very large artifacts with a lot of automated
engines, okay? That you should build an engine, okay, one uniform piece of machinery to carry
off this kind of work and you could actually use that engine, you can advertise its cost
across lots and lots and lots of different kinds of tools. Okay, it’s surprising how
different kinds of tools, right. And it’s also interesting that by using these tools
we manage to carry off some tasks that most people consider to be almost impossible. Most
of our customers have come to us after they’ve tried to do it themselves and failed. Okay,
the B2 exercise they tried to do it internally by hand twice before they came to us. And
they said, basically they’re going to give us the business because we look weird, right?
We don’t look like we did it the way other people did it. And the way other people did
it didn’t work. Right? It was really a strange reason to get a piece of business but it worked
out for everybody. Right. All right, my point is we do hard with this kind of machine, just
sort of very surprising things, things that people would normally think are impossible.
All right. In practice, we’re a small company. Okay. And this only works when we find a partner
that has a specific task of his own, okay? And he wants to bring some to the table. So
we actually work with the integrators or the other guy in the other side of the wall to
carry off the task. So we usually write them lots of advice, here’s a piece of machinery,
we might build some pieces of it, they might built some pieces of it, we might go through
lots of training. It’s an interactive process, okay, with our partners. All right. So, lessons.
One, software is bigger than you are. Worst to getting bigger, all right, it’s expanding
at the speed of light can you catch it, if you just run a light way out to it, you’re
going to catch the edge of the galaxy. Okay. The key technology for dealing is really simple,
really simple. You got to be able to parse stuff. Okay, you have to be able to analyze
it. You have to worry about data flows at scale, all kinds of data flows. Kind of fundamentally
implement all that and make it available to build tools on top of it. You need to be able
to carry–transformation for change. You don’t want to go to the doctor’s office for analysis;
you want to go there for a cure. Tie analysis to change. That I think of anything you walk
away from this talk, it’s tie analysis to change. That’s what tools are really good
at, right? You’re always going to need a custom tool, because your situation is different
than everybody else. You grew up the way you did, whether your perspective’s the way it
is, you have a bunch of historical baggage. You need a tool that deals with the world
at your end. That means you need a custom tool and you’re never going to find it lying
in the gutter, okay? You’ll have to build it. All right, so I think this is another
insight. It’s hard to build. You don’t want to build all those infrastructure by hand.
It’s way too expensive. I will scream if I hear another guy at a working conference in
reverse engineering saying, “I’d like through–I’d like to redesign the system and if I just
had a parser…” I’ll scream if I hear that again, right? They’re always there. They don’t
seem to understand that even you have a parser, okay, it’s like climbing the Himalayas–it’s
like climbing Mt. Everest. The first 8,000 feet is easy, anybody can do it with a backpack.
The last 19,000 feet requires a completely different technology to get up to the peak,
completely different game. Right? It looks easy. Yeah, first 8,000 feet is easy. Any
clod can do it but that doesn’t matter. So the infrastructure is expensive to build,
it’s hard, it has scale. You need generic analysis; you need robust language front ends
and real stuff to fit together. Okay. The bottom line here is DMS is an engine for doing
this. The reason I go on and proselytize is because there are no other engines on the
planet look anything like this at this kind of scale that I know of. There are some research
systems called Stratego and Tixel which you can download for free and they say–they tell
the same stories but they don’t apply them like this. Okay? They aren’t going off and
fighting the battles that people have, the real battles, you know, can you really parse
C++? Can you really do 10 million lines of code? Can we do this flow analysis? Can we
do call graph this entire system? They don’t do those kinds of things. It takes an investment
and the machinery 15 years to get here. Okay. I’m open for questions. I understand there’s
an audience here. Okay, with mutes so you may have to turn your mic on if you want to
ask your question. And we have a mic here for people to step up and ask if they wish.
>>I was curious when you were talking about looking for duplicate sections in the code
or, like, copy-pasted code. Are you able to do that even if it’s not identical code, if
it’s just similar in structure or maybe close to the same code when you look at it?
>>BAXTER: Yeah. We detect what we call “exact clones” and what are called “near-miss clones.”
Now, what we don’t detect are semantically equivalent clones. It doesn’t detect two pieces
of code which compute the same thing but do it in radically different ways, detects pieces
of code that people have literally copied and modified, okay? It’s really weird perspective
to realize what makes clones findable is the fact that somebody stole them and the act
of stealing them made them visible. Okay. But so it’s not finding semantically equivalent
things, it’s finding syntactically equivalent things that somebody has stolen. So, the smart
part of the clone detector really enough isn’t the clone detector, it’s the guy that stole
the code. He made it visible; he went and identified this blob of code, this thing with
its formless boundary. He said, “No, no. Here’s the line. Here’s the part which is good. Pull
it out. Set it over here.” And the moment he does that, now you can see the boundaries
and they’re drawn–they’re drawn black, all right?
>>[INDISTINCT]>>BAXTER: Microphone.
>>Example of a smart differencing that you showed. That’s some–it didn’t seem very different
from a line differencer. So, I guess that’s probably because you chose to show a simple
example. But in cases of semantic differences, have you thought about how to show that visually?
Because…?>>BAXTER: Well–so let’s talk about the example
for a moment, okay? I mean, you’re right. If you were to do this line by line, a pure
line detector might actually detect this. Okay. The question is what did it say? Okay.
A liner in detector wouldn’t show–wouldn’t draw those green patches. They would simply
say that line 10900 was different and line 11030 was different. What our tool says is,
“The variable name–term name has been renamed term title throughout this block of code,”
that’s all it says, it says it once. So it’s a very–it really isn’t a liner differencer.
Okay, you can’t see it in this particular one. Okay, go back and ask the other part
of the question again.>>The question that I was getting to is if
you’re showing–if you’re finding differencers that are–show their difference in an abstract
syntax tree implementation, how do you actually surface that back so that it makes sense to
people?>>BAXTER: We don’t–we don’t show it. You
never show an abstract syntax tree to a person. That doesn’t work. Okay. That’s the reason
for the pretty printers. What the pretty printers do is allow us to take an arbitrary internal
data structure and convert it back to source code. And so the kind of thing that C coming
out of the abstracts–out of the smart differencer if you tell it to do is, “I move this block
of code from here to there.” And it’ll actually show the block of code. It’s this block of
code. And it might start in the middle of the line and end in the middle of the line
because it’s pulling out the abstract syntax tree, not the lines. All right. So it shows
you the source text the way you would see this as a programmer. I can only put up one
slide here and have time, right? So…>>Assuming what’s the largest system you’ve
ever worked with? You’re talking about, you know, apply this to Google. I don’t know how
many lines of code we have; maybe 50, maybe 100 meetings, things like that. Have you done
anything like that?>>BAXTER: Well I think the right way to ask
the question, what’s the largest system we worked with? Okay. I mean, my suspicion is
that you don’t have a single monolithic 50 million line system. You probably have a lot
of five to ten million lines systems. Okay, you may prove me wrong, okay? We did have–well
the packet switching customer had 35 million lines of C code as a single system. Okay.
And they stretched us to the hilt. And one of the problems there is what to do, what’s
called the “points to analysis,” where what you want to know is for each pointer in the
system, what could it point to? This is what’s called a “May analysis”. This may point X,
it may point to Y, okay? And we essentially had to that analysis across all 35 million
lines of code. Okay, the university papers weren’t good enough. Okay. We actually did
something to do this, it ran for, I think it was six and half days on an 8-core machine,
okay? And used 90 gigabytes of ram to compute this answer. Probably not big by your standards,
it was big for ours–okay, to compute this answer. That’s the biggest thing we’ve dealt
with as a single monolith. I had a conversation, an eye-opening conversation with the CIO at
Metropolitan Light back in 1995 when I first started the business. Okay. At that time we
had no technology at all and so we walked around proselytizing, okay? So we went to
visit these guys because well, they seemed like interesting customers. And I asked this
guy, “So how much code do you have?” And he says, “I have a hundred million lines of Cobol,”
all right? I said, “Oh.” He points out the window–we’re in New York–he points out the
window at this building, three buildings away, “See that building over there?” He says, “It’s
full of Cobol programmers. Okay. I need to control that.” I said, “What’s the problem?”
He says, “It isn’t the hundred million lines of code that makes me crazy. It’s growing
at five million lines a year.” That was 1995 God knows where that man is now. He’s probably
retired. But his successors got 200 million lines of COBOL. We went out to visit Social
Security Administration about two months ago. They do have two hundred million lines of
COBOL. Okay. And that’s the mainframe part, then there’s the external phasing part they’ve
been doing all this distributed stuff in C# and Java. Two hundred million lines of COBOL,
it’s breathtaking. I will say that we’re probably not very good at this. We’re probably better
than anybody else in the planet. Okay, it’s a tough choice and one of our struggles as
a technology company is to try and make sure that we stay on that curve that we follow
people up as their systems get big, right? Because otherwise we’re all headed for collapse,
okay? Black holes happen, okay? When the amount of mass in there gets to be too big, pfft,
then you’re dead. You don’t want to be there when you get a Schwarzschild, a radius effect,
right? Anybody else?

3 thoughts to “DMS: Software Tool Infrastructure”

  1. I can appreciate the amount of work that would need to go into such a system.  A parser really is just the beginning.

Leave a Reply

Your email address will not be published. Required fields are marked *