SGML, HTML, DHTML, XML, XHTML, etc.



But computers don't know the difference between content and presentation.

Computers don't even know the difference between letters and numbers.

Computers process all text - letters or numbers - as series of binary numerical codes - 1's and 0's. When a computer writes the letter 'A' on to your hard drive, it doesn't create an image of the letter 'A', but writes a series of 1's and 0's that represent the letter 'A' from a table of code. When your computer "reads" the letter 'A' from your hard drive, it really reads a series of 1's and 0's and then consults a font file for selecting the character shape of 'A' that it shows on the computer monitor.

Bob Bemer developed the American Standard Code for Information Interchange, ASCII. In 1960, there was no such standardization. IBM's equipment alone used nine different character sets. "They were starting to talk about families of computers, which need to communicate. I said, 'Hey, you can't even talk to each other, let alone the outside world,'" says Bemer, who worked at IBM from 1956 to 1962.

Bob Bemer's home page

ASCII is a seven-bit code that consists of 128 decimal numbers ranging from zero through 127 assigned to letters, numbers, punctuation marks, and the most common special characters. The Extended ASCII Character Set also consists of 128 decimal numbers and ranges from 128 through 255 representing additional special, mathematical, graphic, and foreign characters.

 

UNICODE

During 1980s researchers at Xerox begin mapping every character to a 16-bit code. They developed a "unique, universal and uniform character encoding" - UNICODE.

universal - encompasses all world languages
uniform - fixed-width codes
unique - bit sequences has only one interpretation

Unicode provides a consistent way of encoding multilingual text and helps the exchange text files internationally. The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name.

The original goal was to use a single 16-bit encoding that provides code points for more than 65,000 characters. While 65,000 characters are sufficient for encoding most of the many thousands of characters used in major languages of the world, the Unicode standard and ISO/IEC 10646 now support three encoding forms that use a common repertoire of characters but allow for encoding as many as a million more characters. This is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world, as well as common notational systems. Unicode home page

 

Text Markup    <strong>Text Markup</strong>
Text Markup    <strong style="background-color:yellow">Text Markup</strong>

The digital processing of text requires distinguishing the "content" text from flags or signs embedded in the text that signal how the content text should be processed.

  • 1967 - William Tunnicliffe distinguished the content of documents from their format at a meeting of the Canadian Government Printing Office.
  • 1969 - IBM researchers invent the Generalized Markup Language (GML).
  • 1978 - An American National Standards Institute working group was formed to provide a format for text interchange and a markup language for future processing. Introduced new concept of structural markup: titles were marked as <title> rather than <bold> and <center>. By marking a title as <title>, database searches could be limited to titles. This was the beginning of Standard General Markup Language, SGML, which represents the structure of a document.
  • 1980 - First draft of SGML
  • 1986 - SGML approved as ISO international standard 8879

Word processing before Windows and WYSIWYG ("what you see is what you get") editors

The PC-WRITE Example

Computers and Composition 2(4), August 1985, "PC-WRITE: Quality Word Processing at a price that's hard to beat" Bob Wallace, the author of PC-WRITE, has been designing text-processors since 1969. In 1978, he joined Microsoft (the company that wrote MS DOS for IBM) when the company had only ten employees. Five years later, Bob decided to break with Microsoft (which by then had grown to company of over 300 employees) and establish his own company: Quicksoft.


 

SGML - Standard General Markup Language

SGML differs from other markup languages in that it does not simply indicate where a change of appearance occurs, or where a new element starts. SGML sets out to clearly identify the boundaries of every part of a document. To allow the computer to do as much of the work as possible, SGML requires users to provide a model of the document being produced. This model, called a Document Type Definition (DTD), describes each element of the document in a form that the computer can understand. The DTD shows how the various elements that make up a document relate to one another.



HTML - HyperText Markup Language

HTML is a document-layout and hyperlink-specification language. It defines the syntax and placement of special, embedded directions that aren't displayed by the browser, but tell it how to display the contents of the document, including text, images, and other support media.

"Yield to the browser. Let it format your document in whatever way it deems best. Recognize that the browser's job is to present your documents to the user in a consistent, usuable way. Your job, in turn, is to use HTML effectively to mark up your documents so that the browser can do its job effectively. Spend less time trying to achieve format-oriented goals. Instead, focus your efforts on creating the actual document content and adding the HTML tags to structure that content effectively." Chuck Musciano & Bill Kennedy. HTML: The Definitive Guide O'Reilly, 1997



DHTML - Dynamic HTML

"Adding effective Dynamic HTML (DHTML) content to your pages requires an understanding of other technologies, specified by additional standards that exist outside the charter of the original HTML Working group...DHTML is an amalgam of specifications that stem from multiple standards efforts and proprietary technologies that are built into the two most popular DHTML-capable browsers, Netscape Navigator and Internet Explorer, beginning with Version 4 of each browser." Danny Goodman, Dynamic HTML: The Efinitive Reference O'Reilly, 1998

Technologies covered by Goodman: (1) Cascading stylesheets and (2) JavaScript.

[Note: This web page is an example of DHTML]



XML - Extensible Markup Language

XML is text-based markup that permits authors to invent their own tags, hence Semantic Markup

<?xml version="1.0" encoding="UTF-8" ?>
<pets>
	<dog>
		<name>Fido</name>
	</dog>
	<cat>
		<name>Fluffy</name>
	</cat>
</pets>

Example: An organization chart in XML

One consequence of permitting authors to invent their own tags is that XML coding must be strictly correct - no broken or missing tags.

Associated technologies are XSLT - Extensible Stylesheet Language Transformation and XML Schemas - schemas act as definitions for XML documents by declaring their structure. An XML schema validates and instance of an XML document. Validation is important because it permits you to be sure that the XML instance you have is correctly structured according to its defintion.

Jon Bosak is Sun's XML architect. He organized and led the working group that created XML and served for two years as chair of the W3C XML Coordination Group. He is a founding member of OASIS, the Organization for the Advancement of Structured Information Standards, and of its predecessor, SGML Open. At Sun he holds the position of Distinguished Engineer.

Something to read: The Birth of XML: A Personal Recollection by Jon Bosak



XHTML - Extensible HyperText Markup Language

XHTML extends HTML by making it XML compliant. This permits standard XML tools to view, edit and validate them. "The XHTML family is the next step in the evolution of the Internet. By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility." XHTML 1.0, W3C Recommendation, January 26, 2000



Postscript: "Ink Data"

All Tablet PC computers have a digitizer beneath the screen that accepts pen input. Ink is a new data type designed for use on the Tablet PC that provides real-time visual feedback for pen-based input.

The InkCollector stores collected data in an Ink object. Ink objects contain collections of other objects such as the Strokes collection, which in turn, contains one or more Stroke objects. A stroke is a set of data captured in a single pen-down, pen-move, and pen-up sequence.