PLINTH, HTML and the WWW: The World Wide Web

2. The World Wide Web

The World Wide Web (WWW) [5] began as a project at CERN in Switzerland to make sharing documentation between their research partners easy and efficient. In the last couple of years, however, it has grown to truly world-wide proportions, and continues to grow at a phenomenal rate, with dozens of new academic, commercial and governmental WWW sites appearing every day at the time of writing.

2.1 How does the WWW work?

The WWW is a essentially a distributed hypermedia system which uses the vehicle of the Internet, the vast international network of networks made up of hundreds of thousands of computers and accessible to millions of users. The basic idea is for each WWW site to store its publications, information about its research, academic or commercial interests and activities, and any other 'resources' that might be useful to others, in the form of hypertext documents marked up using HTML (see below). The site then runs a server program which accepts electronic requests for document pages from clients elsewhere on the Internet, using http, the hypertext transfer protocol [6]. The server transfers the requested page to the client where it can be viewed using a special browser program (e.g. the widely used NCSA Mosaic) that sends http requests and displays the results. Each retrieved page may have embedded pointers to other pages, and activating one of these (e.g. by clicking on a highlighted word or phrase in the text with the computer's mouse) causes the client software to send off a new http request, possibly to an entirely different server.

2.2 HTML: the hypertext markup language

HTML is the standard ASCII markup language for WWW hypertext documents, and all WWW browsers must be able to handle it. HTML provides a simple format for describing interlinked information resources, by means of tags embedded in texts, and it can be used to represent structured documents with inlined graphics, menus of options, hypertext views of existing bodies of online information, database query results, and many other types of information [7].

HTML is defined in terms of the ISO Standard Generalised Markup Language (SGML [8,9], which is a language for describing structured document markup schemes. Every SGML document starts with a header consisting of a declaration and prologue, which describe both what constitutes an element in the ASCII data that follows, and how those elements may legally be combined (i.e. the 'morphology' and 'syntax' of the data and markup tags, to use a linguistic metaphor). An SGML parser program will check that the following text and markup tag sequence (the document instance) is well-formed with respect to the initial header. However, since all HTML documents share the same SGML header (by definition) this does not need to be included in every HTML resource on the WWW, or transmitted as part of every http transaction; instead it can be built into the client browsing software.

It is important to note that HTML markup is not intended to describe the exact visual appearance of a hypertext document (e.g. specific fonts, point sizes, paragraph layouts and margin widths). Rather it describes the logical structure of the document (e.g. emphasised text, headings, bulleted lists), and it is up to HTML browsing software like NCSA Mosaic to decide how best to present this to the user.

In any case, the most important feature of HTML is not its ability to describe the internal structure of a document, but its facility to embed cross-references to other WWW resources. This is done by means of link anchor tags, which specify the target resource by means of a universal resource locator, or URL. Full details of URL syntax are available in [10], but basically a URL has the form:


        protocol://location/file#destination

where protocol is http, ftp, or gopher (see 2.3 below), location is the Internet machine address (e.g. www.aiai.ed.ac.uk), file is the pathname of the resource on the remote machine (e.g. /~andrewc/home.html) and the optional destination is the name of a source anchor tag marking a point within the target file.

Again, it is up to the WWW browsing software to decide what to do with this information. NCSA Mosaic, for example, highlights the word or phrase with which the link anchor is associated, and responds to a mouse click on the highlight by loading the resource and displaying it, whereupon the process may be repeated for links from the new page.

2.3 Other WWW data formats and access protocols

The resources stored at a WWW site may not all be in HTML format, and the http server can handle files of many different types, including plain (unformatted) text, PostScript ready for laser-printing, diagrams and pictures in X bitmap, gif and jpeg encoding, mpeg movies, and various audio formats. The server detects the file type automatically and reports it to the client, using the MIME (Multipurpose Internet Mail Extensions) [13,14] resource classification, before sending the data itself.

In addition, there is a huge amount of information on the Internet that is available not through http servers, but through other data transfer protocols and programs such as ftp and gopher, which have been around much longer than the WWW. Therefore ftp and gopher access have been integrated into the WWW data model, as we saw in the description of URLs above.

Accessing WWW resources from PLINTH...