FIT3084 lect2

FIT3084: Behind the (World Wide) Web

In the previous lecture:

A medium is something through which something (information, an idea, an experience, a message from the dead) can pass.
Working with multiple media requires an understanding of:
- the different media
- the way the media work together

In this lecture:

What is the World Wide Web (WWW)?
How does the WWW simplify Internet information retrieval?
What is a URL and how is it used to retrieve data from the WWW?

What is the Internet?

Began as a de-centralized system of US defence force computers (late 70s)
Government agencies, scientific research labs, universities etc. started connecting their computers to the network (the 80s).
Eventually spanned substantial portion of the developed world (the 90s onwards).

Selected Internet use statistics from 1990 to 2006 (from Gap Minder).

Have a look at Gap Minder in your own time to answer the following questions.

Which country currently has the largest percentage of Internet users?
Which country has the largest number of Internet users? How long as this been the case?
Which country has the lowest percentage of Internet users?
What do these statistics mean for you as a web site publisher?
How does wealth relate to Internet usage around the world?

What is TCP/IP?

Transmission Control Protocol/Internet Protocol is a low-level protocol by which Internet computers of different makes, models and operating systems communicate.

How is information retrieved from the Internet?

Use one of the (many) high-level protocols and its software user interface.

ftp - file transfer protocol for down/up loading files
telnet, rlogin, ssh - for access to remote hosts
NNTP - Usenet bulletin board and news posting protocol
SMTP - email protocol, one-to-one or one-to-many message sending

As well as Gopher, WAIS... find out what these are (were) by doing a little web surfing!

File formats: storing information on the Internet

There are thousands of different file formats.

A file format is a particular way of storing or ordering information in a file.

The specification of a file format includes information regarding what goes into a file, and the order it is written/read.

Here are some you might find on the web:

PostScript / EPS
RTF
LaTeX
troff
SGML
PDF
Plain text
Proprietry word-processor formats

AIFF, MP3
GIF, JPEG, PSD, PICT, PIC, PNG, RGB, SGI, TGA, BMP, RAW, SUN, TIFF... these are a few of many file formats for images.
Quicktime
VRML, XHTML

...the list goes on and on.

Special software is needed to view, hear, play, read, interpret or edit any file format.

The problems for Internet information retrieval.

Q. Where do I look?
Q. What software do I use to look for and retrieve the files?
Q. How do I use that software?
Q. What file formats do I need to be able to interpret/decode to find the answer?

How does the WWW relate to the Internet?

The WWW began in 1989 at CERN lab to help simplify the retrieval of information from the net.

See the WWW's 20th birthday celebration page.

The idea underlying the WWW is that a user is able to transparently jump around the global Internet retrieving information without worrying about the 4 problems posed above.

Now, to answer the questions above...

Q. Where do I look?
A. The WWW

The Web glosses over the hundreds of individual computers, directories etc.
Q. What software do I use?
A. A Web browser

Only a single piece of software! The browser communicates using several high-level protocols and eliminates the need to master numerous pieces of software.
Q. How do I use the software (web browser)?
A. By clicking the mouse on hyperlinks or selecting them from a menu.

What could be simpler? Previously, software was used by typing cryptic commands into command-line user interfaces.
Q. What file formats do I need to decode?
A. None, the web browser handles that for you!

The modern browser will (sometimes with the help of plugins and helper-applications) display images, play sounds, layout text and interpret a diversity of file formats without you needing to lift your finger from the mouse button! You take this for granted now, but things were not always this simple.

The size of the indexed WWW

Number of webpages

Size of the Web Statistics

GYWA = Sorted on Google, Yahoo!, Windows Live Search (Msn Search) and Ask
YGWA = Sorted on Yahoo!, Google, Windows Live Search (Msn Search) and Ask

The Indexed Web contains at least 22.53 billion pages (Monday, 20 July, 2009)
See World Wide Web Size for details of their estimation technique

Making a personal mark on the WWW.

Originally, the WWW contained information posted by a few companies, research organisations or university academics who had or hired resources and skill to build a web-site and set up a web-server.

		Personal homepage: a homepage was the original way to make a personal mark on the WWW. These were always "under construction" and often out of date due to the amount of time it took to maintain them. Nevertheless, they are still popular due to their flexibility and the availability of software to edit HTML web pages easily and in a WYSIWYG fashion.
		Social networking. Sites such as MySpace, and lately Facebook, have become the most popular ways to make a mark and to interact with other like-minded people. These sites allow users to quickly and easily establish and maintain an online presence as long as they are happy with the restrictions the sites prescribe.
		Blog. Rather than just having a space online, people with something to say use a web-log (blog) to make posts that may include text, images, music, and links to other information. Readers follow these like they'd read a daily newspaper, or by subscribing to a feed. Readers can also comment on the posts.
		Micro-Blog. Sites like Twitter permit people to publish even the most mudane aspects of their lives in short snippets called tweets.
		Content sharing. Del.icio.us, Flickr, YouTube are content sharing websites to which users can upload links, images and movies for others to watch and comment on.

Other Notable Applications of the WWW.

	Find and buy goods from large retailers anywhere around the world and have them shipped to your door.
	Find and buy (especially second-hand) goods from small retailers anywhere around the world and have them shipped to your door.
	Pay for things securely over the Internet using a credit card.
	Find places, look at street views and aerial photographs of (nearly) anywhere! Find web pages, images, scholarly papers, books online.
	Do your banking and pay your bills online.
	Receive current data on currency exchange rates, stock prices, traffic flow, weather, sporting results...

Identifying files on the Internet.

The Internet is a global network (of networks) of computers.

Every computer on it has a unique numerical address (an IP address) and a people-friendly equivalent. You can find out the IP address of a machine using the UNIX host command (type man host at a UNIX prompt to see how it works).

130.194.64.140 ...is the numerical address for our department's old web server... shelob.csse.monash.edu.au

The Internet is divided into domains, and subdomains.

shelob	is the machine name.
csse	is the Computer Science and Software Engineering subdomain.
monash	is the Monash University domain.
edu	indicates the address is educational. What other extensions are there for different types of institutions?
au	indicates the address is Australian. What other extensions are there for different countries?

Every file on a computer has a filename unique for that machine. When appended to the IP address of its host computer, every file on the Internet therefore has a unique name.

Steps for Retrieving Documents from the Web.

Computers on the Internet called name servers keep lists of numerical IP addresses & people-friendly names and translate between them.

1) A web browser (client) sends a request using HyperText Transfer Protocol (HTTP) for a document, specified by its unique name, to a remote (server) machine.

The unique file name is specified within a Uniform Resource Locator (URL)...

Protocol://server_domain_name/file_path

The protocol may be omitted within some web browsers in which case HTTP is assumed.

Absolute URLs

http://www.csse.monash.edu.au/~aland/index.html

ftp://ftp.cs.monash.edu.au/pub/

are absolute because they include a domain name and a path.

Relative URLs

index.html

../index.html

are relative because they specify a path and domain name by reference to (usually) the URL of the file currently open in the browser (often referred to as the base).

Locations within documents

http://www.csse.monash.edu.au/~aland/index.html#chapter

index.html#fred

The text after the # symbols indicates a location within the document specified by the URL.

These locations are named whilst the document is being created. The #location is an optional part of a URL. When would it be useful?

2) A web server program on a remote machine always listens on a well-known port for incoming requests. (Port 80 for HTTP)

3) The web server checks client access privileges, if all is well, it sends the requested document.

4) A browser displays the document retrieved from the server on the client machine in human-readable form

A web document is anything accessed with a single request from a client to a server.

Try this in your own time...

Commands to type.	Explanation.
telnet www.csse.monash.edu.au 80	Telnet to the school's WWW server (on port 80)
GET /index.html HTTP/1.0	Access the web page "index.html" using the GET command which the browser would normally do for you. Follow your command with two carriage returns.
>> The server should send you the HTML of file "index.html"	See? The protocol isn't magic, you can participate in it manually.

This lecture's key point(s):

The WWW is a document network linked by hyperlinks.
The WWW and Web browser mask the complexities of accessing computers and files on the Internet and therefore...

... the WWW simplifies the task of retrieving information from remote computers.

Courseware | Lecture notes