http

http - hyper text transfer protocol
URL - universal resource locator (address!)

Here is a little experiment for you to try:

Action	Explanation
`telnet www.csse.monash.edu.au 80`	Telnet to some world wide web server (80 is the usual port number)
`GET /index.html HTTP/1.0`	GET followed by the local address of some HTML page; GET is normally issued by a web browser but you can do it yourself. HTTP/1.0 is the original and simplest http protocol. Terminate with 2 carriage returns.
The web server will send the contents of the page and then close the connection.

And what you get back is:

HTTP/1.1 200 OK Date: Thu, 10 Jul 1997 04:01:56 GMT Server: Apache/1.2.0 Last-Modified: Tue, 08 Jul 1997 05:17:20 GMT ETag: "6e94-19d1-33c1cd60" Content-Length: 6609 Accept-Ranges: bytes Connection: close Content-Type: text/html <HTML> <HEAD> <TITLE>Department of Computer Science, Monash University: Home page</TITLE> </HEAD> <BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000EE" VLINK="#0000EE" ALINK="#FF0000"> <CENTER> <IMG SRC="/images/cs/hp_title.gif" WIDTH=523 HEIGHT=209 ALT="Department of Computer Science, Monash University"> ... ...

Note in particular the line Content-Type: text/html and the blank line following it.

Getting a Page

When someone clicks on a link in a web-browser, or requests an HTML page from a given URL, a connection is made over the internet to the specified port on a host or server-computer. The connection is to a web-server program, such as Apache, running on the server computer. The browser and the server communicate using the http protocol.

The browser, running on the client computer, sends a GET request for the page. The server returns some header information - date last modified etc. and the content-type followed by the contents of the page (file).

http://www.csse.monash.edu.au:80/index.html
			        ^^^^^^^^^^^
			        file or page name

                              ^^
			      port number; 80 is the default

       ^^^^^^^^^^^^^^^^^^^^^^
        server-computer, host

^^^^
protocol

The default port for http is `80' and it is usually omitted and taken as read.

The connection and GET can even be done without a web browser, just using telnet - see above.

The Web Server

The web-server program runs as some user on the server computer, e.g. as user `www'. This user does not have any special privileges; it is not root! It presents to the outside world a view of some directory structure(s) starting at the first slash (/) after the host and port in the URL. This will not in general be equal to "root" (/) within the file system of the computer on which the web-server program runs.

In the diagram above, what the operating system on the server computer sees as ~www/ is seen by the outside wide world through http as /. However, web servers can be configured to determine what file systems and directories are displayed through the web.

Cgi programs (see later) also run as the user `www'. Other "real" users need to understand this and to note that such a program will therefore assume that any files it uses also belong to `www' unless they are referred to by their full path names.

User Pages

If the requested web page belongs to the user www, the server will find it within www's own files. If the page belongs to some other "real" user, e.g. http://www.csse.monash.edu.au/~lloyd/, then it must be readable by www or it cannot be returned to the browser.

It is customary for there to be some convention where other users place their world wide web files in particular location, e.g. in a subdirectory called public_html, so that these and only these files are allowed to be transmitted as part of the world wide web.

Unix, and many other operating systems, have protection bits for all files covering, e.g. read, write, execute permission, for self, group and "world". However, world means "any user on this computer"; it does not really mean the wide world. In many ways one would like to be able to set permissions covering self, group, other local users and (really) world.

Caches, Proxies

The process of getting a page is usually more complex than was sketched above - all on the grounds of efficiency: The web browser keeps a cache of pages, images etc. and will use a copy of the page from the cache if it can (A).

Browsers can also use a proxy server. This is a pseudo-server, usually on a local computer, that keeps what is in effect a vary large collective cache for those who use it (B). Results from cgi-bin programs are generally not kept in a cache or a proxy.

If a page or image is not in the proxy, the proxy will contact the relevant web-server, pass on a copy to the browser (C) but keep a copy for itself, for a certain time, in case other readers access it.

The use of caches and proxies reduce the amount of traffic on the internet. It does mean that there are multiple copies of data and raises the question of consistency. For many applications it is not essential that all copies of data be absolutely consistent and uptodate, but for some applications it is.

The browser can be configured not to use a proxy for URLs from certain domains - usually local ones where access is quick.

It is possible to set an expires date on web pages, images etc., This can be done by the web-server or by an author including a META tag in the HEAD section of an HTML page, e.g.
<META HTTP-EQUIV="Expires" CONTENT="Mon, 22 Sep 1997 01:50:52 GMT">
The content can also be a number of seconds, from the file last being modified, or now. Once a file has expired it will not be cached.

This mechanism lets an author ensure that readers get reasonably uptodate versions of files. However, it can slow down apparent access speed and should only be used "appropriately". After all, caches and proxies are there for a reason. Some web-sites set short or even zero expiry deadlines, not because their pages change frequently, but simply to increase the number of "hits". Because of this, some ISPs now set their proxies to ignore the deadlines.

Server Logs

A web server might keep various logs recording its http traffic - access log, referer log, agent log, error log. Keeping these logs is not a part of the http protocol, it is just something that most servers can be optionally configured to do. Updating the logs takes a certain amount of time, but more importantly takes a lot of files space. The logs can be used for page counters, refering statistics, and browser statistics.

The logs are often read-protected because they contain information that might be sensitive (e.g. cgi-bin parameter values) or of commercial value (e.g. page hits).

Access Log

The access log keeps records of what machine accessed what HTML pages and when. It contains entries such as:

ascend-2-16.cc.monash.edu.au - - [14/Sep/1997:00:16:56 +1000] "GET /~lloyd/tilde/CSC3/CSC3252/Notes/assessment.html HTTP/1.0" 200 10282

Broken up into parts the entry contains:

`ascend-2-16.cc.monash.edu.au - -`	The requesting machine
`[14/Sep/1997:00:16:56 +1000]`	The time and date
`"GET`	The `http` request
`/~lloyd/tilde/CSC3/CSC3252/Notes/assessment.html`	The page requested
`HTTP/1.0"`	The protocol version
`200`	The http code
`10282`	Number of bytes transmitted

Page-counter values can be computed by scanning the access log.

Note that the access log shows the accessing computer, but does not show the user.

Referer Log

The referer log indicates what page was being viewed when another page was requested. Not all browsers set the refering page correctly.

e.g. http://www.csse.monash.edu.au/~lloyd/tilde/CSC3/CSC3252/ -> "GET /~lloyd/tilde/CSC3/CSC3252/Notes/assessment.html HTTP/1.0"

Broken down the entry consists of:

`http://www.csse.monash.edu.au/~lloyd/tilde/CSC3/CSC3252/ ->`	The refering page.
`"GET /~lloyd/tilde/CSC3/CSC3252/Notes/assessment.html HTTP/1.0"`	The page refered to.

The referer log can be used to find out where your "hits" are coming from.

Agent Log

The agent log collects the browsers (or other agents) requesting pages, if set:


Mozilla/3.0C-KIT  (Win95; I)

Mozilla/4.02 [en] (Win95; I)

Mozilla/2.0 (compatible; MSIE 3.0; AK; Windows 95)

Mozilla/3.01Gold (Win95; I)

Mozilla/4.01 [ja] (Win95; I)

Mozilla/4.03 [en] (X11; I; IRIX 6.2 IP22; Nav)

Mozilla/2.01KIT (Win95; U)

Microsoft Internet Explorer/4.40.308 (Windows 95)

VCI WebViewer VCI WebViewer Win32 via germany.net proxy gateway
   via NetCache version 3.1.1d-Solaris

InfoSeek Sidewinder/0.9

WebCompass 2.0

Crescent Internet ToolPak HTTP OLE Control v.1.0

 ...

This information can be used to determine what browser platforms you need to support most strongly.