Go to the first, previous, next, last section, table of contents.

Character Content

An HTML user agent should present the body of an HTML document as a collection of typeset paragraphs and preformatted text. Except for the PRE element, each block structuring element is regarded as a paragraph by taking the data characters in its content and the content of its descendant elements, concatenating them, and splitting the result into words, separated by space, tab, or record end characters (and perhaps hyphen characters). The sequence of words is typeset as a paragraph by breaking it into lines.

The ISO Latin 1 Character Repertoire

The minimum character repertoire supported by all conforming HTML user agents is Latin Alphabet Nr. 1, or simply Latin-1. Latin-1 includes characters from most Western European languages, as well as a number of control characters. Latin-1 also includes a non-breaking space, a soft hyphen indicator, 93 graphical characters, 8 unassigned characters, and 25 control characters.

(14)

In SGML applications, the use of control characters is limited in order to maximize the chance of successful interchange over heterogeneous networks and operating systems. In HTML, only three control characters are allowed: Horizontal Tab (HT, encoded as 9 decimal in US-ASCII and ISO-8859-1), Carriage Return, and Line Feed.

The HTML DTD references the Added Latin 1 entity set, to allow mnemonic representation of Latin 1 characters using only the widely supported ASCII character repertoire. For example:

Kurt G&ouml;del was a famous logician and mathematician.

See section ISO Latin 1 Character Entity Set for a table of the "Added Latin 1" entities, and section The ISO-8859-1 Coded Character Set for a table of the code positions of ISO-8859-1.

Go to the first, previous, next, last section, table of contents.