Where would you like to go tomorrow?™

HTML Pro DTD

The HyperText Markup Language composite Document Type Description

DRAFT version 0 revision 8

The HTML composite DTD is Copyleft (c) 1996 by Silmaril Consultants and is protected by the terms of the GNU General Public License, a copy of which is included in the distribution of this software as file gnugpl.html. You may freely distribute and use this material, provided that nothing is done to prevent its further distribution or use. It may not be distributed without a similar condition, including this condition, being imposed on the subsequent user. Modifications should be reported to the author.

This documents version 0.8


Background

The first HTML Document Type Description was devised to support the original versions of World-Wide Web software in the early 1990s. A revised and more widely-debated version was codified as HTML 2.0 by the Internet Engineering Task Force (IETF)'s Working Group on HTML, and adopted by them as a draft standard, RFC1866, in November 1995.

A more advanced but experimental version, HTML+ (now obsolete), had been under discussion for several years, and many of its features were republished as HTML3 in an Internet Draft (March 1995). On expiry, this draft was taken back into consideration by the World-Wide Web Consortium (W3C), which had in the meantime succeeded the IETF as principal motive force in HTML development.

The W3C has attempted to reconcile the sometimes conflicting aspirations of some of its members by publishing two experimental versions, HTML 3.2 (largely HTML 2.0 with the addition of stylesheet and scripting support), and Cougar (a less structured version of HTML3) in May and July 1996 respectively.

Companies implementing Web software, particularly browsers, and individuals implementing HTML pages have throughout this period continually sought additional markup facilities for their own purposes. In some cases element names were simply invented on an ad hoc basis, with no attempt to define when or how they can be used: in other cases, market position was available as leverage to add support for them ahead of their inclusion in a DTD. While this is understandable from a marketing point of view, it has led to a further Balkanization of the DTD position, with separate versions describing different implementations by Microsoft, Sun, and Netscape, as well the implementation of more conservative and practical versions by SoftQuad and other editor makers.

The edition of HTML included here, codenamed Aardvark, is a composite of all known versions, containing all the elements published in the various forms of HTML in the last five years, in a manner which can be used by editors, browsers, parsers, databases, search engines, formatters, and any other conforming application of ISO 8879 - Standard Generalized Markup Language (SGML, the language in which HTMLis written).

Versioning

The versions of HTML included in this edition were taken from public copies of DTDs and fragments on the Web:

  • HTML (CERN)
  • HTML 2.0 (RFC1866, November 1995)
  • HTML3 (Internet Draft, March 1995)
  • HTML 3.2 (W3C, May 1996)
  • Cougar (W3C, July 1996)
  • Microsoft Internet Explorer DTD (March 1996)
  • Netscape Navigator DTD (October 1994)
  • SoftQuad HoTMetaL Pro 2.0 DTD (1995)
  • Form-based file uploads (RFC1867)
  • HTML Tables (RFC1942)
  • My thanks to all the many authors and contributors to these DTDs, whose notes and comments have made it easier to work out what to do.

    The current version of this DTD is 0.8, and it is released for comment conducted on the mailing list by the relevant interested parties.

    Changes made

    Very few changes to structure have been made, although those familiar with the internals of previous versions will notice the large amount of additional material from the other versions.

    The major change is to the way in which content models are used. The established mechanism was for body.content to contain both structural and descriptive elements as peers, so that there was no distinction between, say, a list and some text in italics. The implication was that inline markup, which was identified as flow in earlier versions, should be contained in a paragraph element, and this is in fact what SoftQuad's HoTMetaL does when it imports existing HTML.

    There are now four classes of elements, defined by the parameter entities following:

    Parameter entity

    Elements represented

    structure

    Elements which are generally used to contain the continuous text of the document.

    DIV, H1 to H6, P, UL, OL, DL, LI, LH, DIR, MENU, PRE, BLOCKQUOTE, BQ, FORM, TABLE, ADDRESS, BDO, FIG, CENTER, XMP, and LISTING

    insertions

    Elements which usually contain special-purpose material, or no text material at all.

    BASEFONT, APPLET, OBJECT, SCRIPT, MAP, MARQUEE, HR, ISINDEX, and BGSOUND

    text

    Elements which directly hold text

    Descriptive or analytic markup: EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, Q, LANG, AU, PERSON, ACRONYM, ABBREV, INS, DEL, and SPAN

    Visual markup:S, STRIKE, I, B, TT, U, NOBR, WBR, BR, BIG, SMALL, FONT, STYLE, BLINK, and TAB

    Hypertext and graphics: A and IMG

    Mathematical: SUB, SUP, and MATH

    Documentary: COMMENT, ENTITY, ELEMENT, and ATTRIB

    formula

    Mathematical content

    BOX, ABOVE, BELOW, VEC, BAR, DOT, DDOT, HAT, TILDE, ROOT, SQRT, ARRAY, SUB, SUP, B, I, T, and BT

    The most significant distinction is that the insertions class can be peer with text as well as with structure, whereas text can nest only within structure.

    Some changes have been made to the content models of the text elements to accommodate this, but the most noticeable are SUP and SUB, which may now contain math elements even when used in non-math mode. The exclusion exceptions for MATH have been significantly reduced because it would appear unobjectionable for math to contain many of the elements previously proscribed.

    The exclusion exceptions for PRE now operate for TT and BR, as neither have any relevance in preformatted fixed-width material, but SUB and SUP are now permitted, as it seems perfectly reasonable that an author might want to represent typewritten material containing subscripts and superscripts.

    The infamous problem of mixed content in list items has been tackled head on by simply permitting it: text data is allowed as well as paragraph-level markup. This may cause some less well-endowed editors a little grief, but the advantages of being able to tweak the performance characteristics of various browsers are too great to pass up.

    The ICADD fixed attributes which were added by the late and very much missed Yuri Rubinsky have been reinserted for those elements to which they were attached in RFC1866.

    The astute reader will have noted the new elements ELEMENT and ATTRIB, which I have added to this version to test their use in documentation, as they accompany the existing (although otherwise intentioned) ENTITY and COMMENT. It is not the intention that they should remain past v1r0 unless a formal approach is made to the then controllers of the HTML standard.

    There is one new attribute, ROLE on MATH, which can be INLINE or DISPLAY. This corresponds with established TeX usage. See below for details of the character entity files referenced: there is as yet no inclusion of ISOams* (although that can easily be done), as I want to discuss the math aspects with the experts to see if it is better (given HTML's limited math model) to use only those entities referring to the TeX-defined symbols, or if the whole thing should be replaced with ISO 12083.

    The HTML3 concept of HTML-specific character entity files has been ditched, and this version includes the whole of ISOlat1, ISOlat2, ISOnum, ISOpub and ISOtech. This should have been done years ago, but browser authors are understandably wary of the font problems involved.

    Outstanding items

    The machine-generated status of the DTD file means that there was a substantial amount of legacy comment from the assorted versions used in compositing the DTD. The majority of this has been edited out so that irrelevant and obsolete parameter entities are not left to confuse the unwary reader, but users who locate undetected residua are asked to report them.

    The elements for which no ICADD fixed attribute exists need analysing and the relevant values adding from the International Committee for Accessible Document Design DTD ("-//EC-USA-CDA/ICADD//DTD ICADD22//EN").

    There are undoubtedly errors, both of omission and of comission, and I would be very grateful for details.

    Peter Flynn