Introduction – Intro to Computer Science

Introduction – Intro to Computer Science


[Dave] Welcome back for Unit 2! I hope everyone is getting confidence in the things you learned in unit 1. We’re going to continue to build on those in this unit as well as in rest of the course. Our main goal for this unit is to make the web crawler, instead of just finding one link in the page, to find all the links on a page, so that we can follow these links and collect more and more pages on the web. To do that we need two big new concepts in computer science. The first is procedures. Procedures are a way to package code so that we can reuse it more easily. The second is control. We need a way to be able to make decisions, and to do repetition, to find all those links on a page. What we saw at the end of Unit 1 was a way to extract the first URL from a web page– that’s great–we could find the first target. But if we want to build a good crawler, we don’t just care about the first one, we care about all of the links on the page. We need to extract all of those links– figure out where all of those links point to– so we’ll find many more pages to crawl than just the first one. So, that’s the goal for this class as far as building a web browser. To do that, we’re going to learn two really big ideas in computer science. The first is about procedures, and that’s a way to package up code so we can use it in a much more useful way than we could before. The second is about control. Control structures will give us a way to keep going to find all the links on a page. So, let’s remember the code we had at the end of Unit 1. We solved this problem of extracting the first URL from the page– we assumed the page was initialized to the contents of some web page. We initialized the variable “start_link” to the result of invoking “find” on “page,” passing in the start of the link tag. Then, we initialized the variable “start_quote” to the result of finding, in the page, the first quote following that link tag. Then, we initialized the variable “end_quote” to the result of invoking “find” on “page,” to find the first quote following the start quote. And, then, we assigned to the variable “url”– extracting from the page– from the character after the “start_quote,” to the character just before the “end_quote,” we could print out that URL. This worked to find the first URL on the page. If we wanted to find the second one, we could do it all again. We could say, now we want to advance so we’re only looking at the rest of the page. We could do that by updating the variable “page,” assigning to it the result of the rest of the page, starting from “end_quote”– and, remember, when there’s a blank after the colon that means select from this position, to the very end– and then we can do all the same stuff. We’ll do “start_link” again; … we’ll do “start_quote” again; … Now we’ve got code that’s going to print out the first URL– keep going, updating the variable “page”– and then doing the exact same thing– printing out the second URL. If we wanted to print out the first three, we could do it again… So now, we’ve got code to print out the first three URL’s on the page– let’s scroll all the way up–so you’ve got– print out the first one–keep going– print out the second one–keep going– so this can go on forever. The reason we have computers is to save humans from doing lots of tedious work. We don’t want to make humans do lots of tedious work– certainly, typing this out, over and over again, would be really tedious, and it wouldn’t really even work that well. We have pages with hundreds of links, but there are other pages with only one or two links. So, it wouldn’t make sense to copy this hundreds of times. There’s always going to be some web page that has more links than we have copies of it– and any page that has fewer copies, we’re going to run into problems because we’re not going to be finding any of those links. So, our goal today is to solve all of those problems.

Leave a Reply

Your email address will not be published. Required fields are marked *