Dr Carlo Kopp's Publications Archive

Industry Publications Index ... Click Here

Emerging Web Standards

Originally published Spetember, 2000

by Carlo Kopp

The explosive growth of the World Wide Web since its inception in 1993 has seen growing expectations from the user base, in terms of the capabilities of tools and the means of representing information. First generation web standards, such as early variants of HTML and those beloved mainstays of web imagery, JPEG and GIF87/89, are less than ideal for the delivery of the kind of content many users now seek.

Of particular importance are the limitations of existing standards in controlling the presentation of data and imagery, the frequently poor density of imagery data, but also the difficulties experienced in machine driven analysis of web content.

The latter has proven to be a major issue, as it limits what can be usefully done by automated systems such as search engines. We have all had the experience of typing in to a search engine what would appear to be unambiguous specifiers, only to be inundated with mostly irrelevant URLs. What is an annoyance to a human reader who has the intelligence and the ability to use heuristics to filter out irrelevant content, amounts to an extremely difficult if not intractable problem for automated system. Without the ability to use heuristics as a human being does, automated systems sink very quickly indeed.

Other problems with the existing HTML 1-3 based web standards are the cumbersome handling of mathematical expressions, a major hindrance to the wider use of hypertext for publishing scientific papers, and the frequently highly inconsistent presentation of markup across different browsers. We have all had the experience of trying to produce a web page layout only to find out that it doesn't quite look the same on browsers other than that used during the testing of the webpage.

Which of these are the most pressing concerns at this time ? Evidently, judging from activity in web standards development, all areas are considered to be serious issues.

In this month's feature we will briefly survey some of the current draft standards or recommendations being explored by the World Wide Web Consortium (W3C at http://www.w3.org/TR/), to provide the reader with some idea of current trends in web publishing standards.

Vector Graphics Standards

One of the most glaring inadequacies of the existing web standards base is the inability to support vector graphics particularly well. Vector graphics are images which are represented by mathematical descriptions of lines, shapes, curves, and areas, rather than fields of pixels set to individual colours.

Vector graphics have some highly attractive qualities for presentation:

They are very dense, in the sense that often complicated entities can be described with fairly simple and terse expressions. Even without lossless compression, an image of considerable complexity can be presented in a file of modest size. With lossless compression the size is frequently comparable to bitmap imagery of much lower resolution.
They provide quality which is limited only by the rendering device, such as a display running the browser or laser printer. Doubling the rendering resolution (ie zooming) of a bitmap image produces something of significantly inferior quality, whereas a vector graphic representation of the same image will usually lose no quality if we zoom in.
Lack of ambiguity in representation. Many bitmap formats vary in presentation quality with the quality of the algorithm used to decode and render the image. JPEGs are a typical example. Vector images do not suffer this problem as frequently.

The ability to cleanly translate the image into another vector representation with no loss of information and thus image quality. If an alternate vector representation standard can support the same graphics primitives, an exact mapping can be performed.

There are many vector graphics based file formats in existence, examples being Postscript/EPS, Computer Graphics Metafile (CGM), HPGL, Adobe Illustrator and a host of others.

Vector graphics representation is the basis of all engineering drawing packages, and it is by far the best way of representing charts, plots, graphs, line drawings and line illustrations.

For a web environment, this technique is close to ideal, since the quality of the image is limited primarily by the quality of the browser and the display or printer. The currently widespread use of GIF and JPEG for this purpose usually does a disservice to most such imagery.

At this stage two schemes are being explored as the basis of future web standards.

The first is the Scalable Vector Graphics (SVG) language, derived from the XML standard. The aim of SVG is to provide a genuine vector graphics language, capable of representing conventional graphical shapes, such as lines, areas, curves, text and embedded bitmap images, which can be grouped, styled, transformed and composited. Facilities exist for shading and colour transitions. The SVG model goes further than conventional vector representations, since it embeds features allowing access to global attributes, as well as providing facilities for scripting. Animation can be incorporated, extending the established XML model.

SVG is now at the Candidate Recommendation phase in the W3C standards scheme, while effort continues on the related SMIL animation scheme.

Another proposal proceeding through the same standardisation system is the WebCGM standard, a web optimised subset of the widely used ISO Computer Graphics Metafile standard (ISO/IEC 8632:1992) vector graphics standard. CGM has been a mainstay of the CAD/CAM industry, and also is the only industry standard vector format accepted by most Microsoft tools (... hint for xfig users who need to submit work in MS Word or PPT).

The attractiveness of WebCGM lies in its huge established industry base, which makes the development of robust output filters a very economical proposition, since it amounts to little more than tweaking existing and often very mature code. The WebCGM proposal is largely derived from the CGM Open consortium (see http://www.cgmopen.org/) CGM model, and the ATA (Air Transport Association) CGM profile, the latter was simplified to conform to existing W3C standardisation requirements.

While SVG is a newer and arguably a more flexible scheme than WebCGM, the latter is almost off-the-shelf. Incorporation of either or both into future browsers will provide an unprecedented increase in the quality of web graphics, and hopefully will also see the eventual disappearance of those tedious, oversized GIFs and JPEGs so popular with graphics rich websites.

Bitmap Graphics Standards

Another important graphical standard to recently emerge is the Portable Network Graphics (PNG) standard, designed to replace the GIF87a and GIF89a standards, without the legal encumbrance of the compression standard in GIF. PNG is however a much more sophisticated standard than GIF, and in some respects is considered a replacement for the Adobe TIFF format.

PNG retains many of the nice properties of GIF, it supports progressive display during a download, transparency, lossless compression and supports 256 colour indexed images. However, it also provides new features, such as 48 bit per pixel true colour, 16 bit per pixel grayscale, a per pixel alpha channel for transparency information, embedded gamma correction to accommodate arbitrary displays, reliable error detection using a 16-bit CRC code and is faster to render initially than a GIF.

In many respects, PNG fills the gap between the legacy technology GIF, evolved in the era of 8-bit graphics adaptors, and JPEG, which as a lossy compression scheme frequently damages fine detail in bitmap images. Web page designers who are fussy about bitmap image quality, this writer inclusive, have long suffered the indignity of putting up with either miserable colour palletes in GIFs or loss of sharpness in JPEGs. PNG provides a TIFF-like large colour pallete, but does so with lossless compression to always retain high image quality.

Markup Languages - HTML 4.01

The mainstay of the current web is the trusty Hyper Text Markup Language, or HTML. A number of HTML versions remain in use, with sites being written around versions 1, 2 and 3. The latest incarnation of basic HTML is HTML 4.01, yet another incrementally improved variant of the basic product.

Like earlier versions of HTML, HTML4 is build around the well loved mechanisms of the HTTP protocol, hypertext, and the Universal Resource Identifier (URL for traditionalists). While earlier standardisation efforts were aimed primarily at forcing interoperability, HTML adds numerous pieces of extra functionality:

style sheets.
scripting.
frames.
embedding objects.
text other than left to right read.
better table support.
better forms support.
ISO/IEC:10646 standard support for international character sets.

The extensions to the basic HTML we love and know so well are quite extensive in a number of areas.

The changes to the table scheme is built around the IETF RFC 1942 model, and is designed to add column groups and column widths, the latter to allow display as the table data is received by a browser.

The IMG and APPLET elements in earlier HTML are now replaced by the generic OBJECT, which is intended for displaying images, video, sound, mathematical equations, and includes provisions for alternate renderings where the browser cannot handle the intended rendering.

A big enhancement to HTML4 is the adoption of a style sheet mechanism to control the layout of a document. Style sheets will be either embedded in the HTML document, or provided via an external style sheet document, and will cover elements or groups of elements in a HTML document. With a style sheet mechanism, HTML authors will be able to locally and globally control attributes such as font information, alignment and colors in a HTML document.

Scripting support is improved in HTML4, the intend being to allow for dynamic form web pages which adaptively change as the reader types responses into them.

In perspective, the biggest gain to be seen from HTML4 is the style sheet mechanism, since it provides a means of putting some consistency into website presentation. This has been, arguably, the greatest weakness of the HTML markup mechanism to date.

XHTML 1.0 - The Extensible HyperText Markup Language

The next step in the evolution of HTML is XHTML, which in the simplest of terms is a reformulation of the HTML 4 standard in XML. The long term aim of the W3C standards community is a to XML, and XHTML is an important transitional phase in this process, since it provides a bridge between what will be established HTML 4 applications and the future generation of XML based browsers and production tools.

XML is a more powerful SGML derivative than the relatively lightweight HTML, itself also derived from SGML. SGML is considered to be extremely complex, and powerful, and its complexity has largely been the reason why it has not become widely adopted in practical tools.

HTML, as a minimal subset of SGML, could not keep up with the expectations of web users, and the strategy of migrating to XML is intended mainly to bypass the plague of proprietary enhancements to HTML, which has been a feature of the web in recent years. By providing a standard which is powerful enough to make all proprietary HTML variants irrelevant, the W3C aims to discourage proprietary players from contaminating the standard with incompatible modifications.

One of the basic aims of the XML standard is to make to easy to define new types of markup, and XHTML is intended to allow straightforward inclusion of XML additions into XHTML documents. Another important aim si to provide mechanisms which allow web servers to optimise the presentation of web site content, depending upon the type of browser and display device being used to access it.

The XHTML proposal describes three categories of compliance for a document: Strict, where only mandatory features are used, 'Transitional' and Frameset, which correspond to their HTML 4 equivalents. The XML namespace mechanism will be supported.

The XHTML proposal also details important differences from HTML4:

Documents must be well formed, i.e. closing tags must be used, nesting must be used, and it is not permitted to overlap elements in the manner tolerated by many browsers (i.e. sloppy HTML syntax).
Element and attribute names, i.e. tags, must be in lower case, since XML is case sensitive. In XML, <li> and <LI> amount to different things.
End tags are mandatory, and the common HTML practice of implied end tags is not permitted in XHTML. The classical <p>blah<p>blah<p>blah construct is illegal and must be presented as <p>blah</p><p>blah</p><p>blah</p>.
All attribute values must be quoted. The example cited is <table rows="3"> against <table rows=3>.
Attributes cannot be minimised. Constructs like <dl compact> are illegal.
Empty elements must be denoted as such, using an end tag, or shorthand, e.g. <br/><hr/> against <br><hr>, which is illegal.
Leading and trailing whitespaces are stripped from attributes.
An element associated with a script or style should be declared as having #PC DATA content.
The SGML exclusion mechanism is not supported in XHTML.
Fragment identifier naming is changed. In HTML4, elements of types a, applet, form, frame, iframe, img, map have the attribute name. In XHTML, the fragment identifier is ID and it replaces the HTML model, to comply with XML syntax.

The XHTML proposal provides guidelines for making XHTML documents viewable using HTML compliant browsers. A number of syntactic tricks are applied, to accommodate problem areas such as empty elements, element minimisation, embedding style, line breaks, fragment identifiers, character encoding, boolean attributes, etc. Careful use of the syntax will allow the creation of documents which are XHTML compliant, yet also compatible with most HTML4, and possibly earlier HTML variants.

Document Object Model

The document object model is intended to provide a interface mechanism through which programs and scripts can dynamically access and update the content, structure and style of documents. The DOM is to be independent of the type of platform used and language used.

The model is to incorporate a family of core interfaces which are to create and manipulate the structure and contents of a document, but also optional modules with interfaces aimed at supporting XML, HTML, generic style sheets and Cascading Style Sheets.

The Phase 2 of the DOM specification is currently in the midst of some argument over the mechanism for handling namespace URIs, given the complexity and ambitious aims of the DOM, this should come as no surprise.

Cascading Style Sheets

The Cascading Style Sheets (CSS) specification is one of the most interesting and potentially useful ideas in the new crop of web standard proposals.

The intent of CSS, currently at Level 2 (CSS2), is to wholly divorce the presentation style of a web document from its content. Features are to include content positioning, downloadable fonts, table layout, features for internationalization, automatic counters and numbering, in addition to support for visual browsers, aural devices, printers, braille devices, and handheld devices. An inheritance property is included to allow style properties to propagate from ancestor documents in a document structure, to inheritors.

Like other proposal specifications in this crop, CSS is both ambitious and far reaching, and is designed to fit with HTML4, XHTML and XML. In many respects, it aims to emulate the well established LaTeX style file model, but also aims to accommodate greatly differing media.

The basic model for CSS processing is that an agent such as browser reads in a document, parses its content and creates a tree structure to describe it. It identifies the intended media for the document, and finds all of the style sheets either embedded in the document or pointed to. Every element found in the document tree structure will have particular properties for the presentation medium in question, and each of these is assigned values given by cascading and inheritance rules, and the contents of the style sheets. Given this information, the agent then generates a formatting structure, to describe exactly how it will render the document, and then renders the document.

How CSS2 will appear in a HTML document is best illustrated by purloining the example presented in the CSS2 specification document. The starting point is a tiny but complete HTML document:

        <!DOCTYPE HTML
PUBLIC "-//W3C//DTD HTML 4.0//EN">

          <HTML>

            <HEAD>

            <TITLE>Bach's
home page</TITLE>

            </HEAD>

            <BODY>

<H1>Bach's home page</H1>

<P>Johann Sebastian Bach was a prolific composer.</P>

            </BODY>

          </HTML>

          Using CSS2 to import an
external style sheet, we alter the document thus:

          <!DOCTYPE HTML PUBLIC
"-//W3C//DTD HTML 4.0//EN">

          <HTML>

            <HEAD>

            <TITLE>Bach's
home page</TITLE>

            <LINK
rel="stylesheet" href="bach.css" type="text/css">

            </HEAD>

            <BODY>

<H1>Bach's home page</H1>

<P>Johann Sebastian Bach was a prolific composer.</P>

            </BODY>

          </HTML>

The <LINK rel="stylesheet" href="bach.css" type="text/css"> element identifies the stylesheet, points to its location as bach.css and identifies its format as text/css. In this manner, the document's appearance can be globally changed by altering a single line in the file. We need not be that aggressive, and the CSS2 specification does allow embeddingof local style information (not unlike LaTeX). The cited example is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
<HTML>
<HEAD>
<TITLE>Bach's home page</TITLE>
<STYLE type="text/css">
    H1 { color: blue }
    BODY {
      font-family: "Gill Sans", sans-serif;
      font-size: 12pt;
      margin: 3em;
      color: red;
    }
</STYLE>
</HEAD>
<BODY>
    <H1>Bach's home page</H1>
    <P>Johann Sebastian Bach was a prolific composer.</P>
</BODY>
</HTML>

Cascading Style Sheets are likely to become one of the most popular and widely used features of the new look package of web standards, and web designers would be well advised to watch developments in this area very carefully.

Mathematical Markup Language

The Mathematical Markup Language or MathML is clearly an instance of the web assaulting that previously inviolate domain of LaTeX and TeX, the markup of mathematical notation and structure.

The intent of MathML is to produce an XML based markup language for the accurate representation of mathematical notation, which is human readable, with the assumption that in most instances conversion tools or WYSIWYG equation editors would be used to generate the MathML source (we can safely assume that the first conversion tool will be a LaTeX to MathML translator).

A detailed discussion of MathML syntax is best left to the W3C website paper (http://www.w3.org/TR/REC-MathML/), but it is illustrative to again purloin a W3C example from the specification document:

(a + b)^2

can be represented in MathML as:

a + b 2

Summary

It is clear that the current standards development effort in markup languages, and associated web oriented vector and bitmap graphics standards, will transform the web we so love over the next decade. The process will see a gradual transition to XML, though the intermediate, XHTML, and the proliferation of standards such as CSS2, MathML, SVG. WebCGM, PNG will see web users enjoying a quality of presentation and interoperability difficult to imagine with today's kludged package of standards.

Is there a down side to this process ? Arguably yes, insofar as the simple text editor based production of web pages will become increasingly tricky to get right, as the complexity and syntactic tightness of these emerging standards asserts itself.

However, the benefits of the new standards clearly outweigh the drawbacks, and the tired cliche the best is yet to come definitely holds true here.