Robbie Robot

Lecture 22

Mapping the Web

References:

Stross, C.
"The Web Architect's Handbook",
Addison-Wesley 1996, Chpt 8.



Searching The Web

The Web is ENORMOUS!
Some of the strong points of the web's structure as a 'free for all' publishing house are its weak points for information retrieval.








Search Engines


Robots, Spiders, Wanderers and Crawlers

The software responsible for searching the web is often known as a robot web-bot, spider or by a similar bug-like name.

Some robots may:

These robots are useful for verifying the hyperlinks of a website and checking the validity of the HTML on the site. (They may be useful for a webmaster with a large site to maintain.)


The beginnings of Dorin's URL Muncher ro-Bot (DUMB)
function DUMB(d)
{
	var newWindow = window.open("", "linklist", "width=300,height=300");
	
	newWindow.document.open("text/html");
	
	newWindow.document.writeln("<B>URL:</B> " + d.URL + "<BR>");
	newWindow.document.writeln("<B>Title:</B> " + d.title  + "<BR><BR>");
	newWindow.document.writeln("Number of links in this document:");
	newWindow.document.writeln(d.links.length);
	
	newWindow.document.writeln("<OL>");
	
	for(var i=0; i < d.links.length; i++)
	{
		newWindow.document.writeln("<LI> <A HREF=\"");
		newWindow.document.writeln(d.links[i]);
		newWindow.document.writeln("\">" + d.links[i] + "</A><BR>");
	}
	newWindow.document.close();
}

A call to DUMB is given at the end of this document's source.

DUMB(document);

It is responsible for popping up the new window when this page is first loaded!

Pretty nifty hah!? Don't get carried away though. JavaScript is paranoid about security so this method, as it stands, won't allow you to search the whole web.



Some other robots:


These robots are useful for creating large databases of material stored on many different machines around the world for use as web catalogues.

Robots usually work by:




Robots must avoid:




Traversing the List of Hyperlinks

Starting at a location on the web reveals a branching structure which, if cycles are avoided, is essentially a tree.

There are two methods for traversing a tree:

Depth First
Breadth First
Of course there is nothing to stop you writing a robot which selects between these methods depending on whether the intent is to search widely or deeply!


There are many traps and pitfalls (not all described above) for the novice robot writer to fall into.

It is a certified

*BAD THING*

to send a robot out to someone's web server
if you don't know what you are doing!

Consult the plethora of web literature concerning standards for robot authoring before proceeding!



Keeping Robots at Bay




lecture notes | home



All material accessed from www.cs(se).monash.edu.au/~aland is
Copyright © 1994-1998 Alan Dorin.
All rights reserved.