From: Hopkins, Don [mailto:Hopkins, Don] 
Sent: Friday, January 16, 1998 1:27 AM
To: 'Tom Lord'; rms@gnu.org
Cc: Hopkins, Don
Subject: RE: markup languages (XML)

> Tom Lord wrote:
>
> Whenever possible, people should use plain ASCII text and stick to 7
> bit characters.
>
> People commonly think they want a larger character set, but in every
> case they are wrong.  The correct answer is to invent spellings for
> all other languages using the existing standard 7 bit character set.

Are you saying that just to be a luddite? I had a professor who looked
like Garrison Keeler (the Prarie Home Companion guy), and who for the
life of him couldn't understand why anyone would want a
non-rectangular window.  To each his own. However many billion Chinese
can't be all wrong.
  
> Sometimes plain text isn't good enough -- you need something with
> more structure.  Whenever possible, your language for describing
> structured text should be isomorphic to the S-expression language of
> Scheme.

As opposed to the S-expression syntax of Lisp? (Do atoms have
properties or are they just internalized strings?) Scheme
S-expressions and atoms don't have properties. All XML elements do,
and they can contain plain text and other entities (references to
other documents, or special "user defined" characters like
&CompanyLogo; or &TheArtistFormerlyKnownAsPrince; ... So XML is a
superset of Scheme S-expressions. Scheme doesn't have anything like a
document type defintion, or standard metadata. In XML, as opposed to
SGML, the DTD is optional -- you can have a "well formed" document
without a DTD that is syntactically correct and can be parsed, and you
can also have a document that refers to (or contains, or elaborates) a
DTD, which is "valid'.  One of SGML's problems is that there are
ambiguities that prevent it from parsing any document without knowing
the DTD. The <p> seperator in HTML is an example of an un-nested token
that is ambiguous (it doesn't require an end tag), so an SGML tool
can't parse HTML without a DTD. In XML, a naked <p> is illegal, but it
can be written either as a self-terminating tag <p/>, or
<p>paragraph</p>, or even <p>paragraph</> because the parser always
knows which tag the </> balances.

> For applications where people have to type structured text directly,
> and there is more text than structuring annotation, a convenient
> surface syntax is needed.  Wherever possible, standard S-expression
> syntax should be a subset of that surface syntax.  For example, the
> surface syntax might look like texinfo or troff, except that when a
> formatting command requires arguments, those arguments would be
> written as s-expressions.
>
> There should be a generic library for reading and writing
> s-expressions.  It should have two modes of operation: garbage
> collected and batch-allocated (for applications that don't link with
> a GC).

Microsoft provides the source code in Java to a generic XML parser
library on their web site, for free. You should take a look at the
copyright notice and source code, and make up your own mind if it's
good or evil, unless your preconcieved notions allow you to
short-circuit that exercise. They ship two XML parsers (the Java one
and a C++ one) with IE4.0, as ActiveX components (the non-visual kind,
that you just use as a library, from any language). Now you can write
web pages with vbscript and javascript that read XML documents from
the net, munch them however you like, fart the data out into a dynamic
web page, and even edit it and send it back to the server.

> Now if you want to do something like express invariants over the
> allowable set of structured texts for some application, you should
> state those invariants strictly in terms of the s-expression reading
> of the strutured text -- not in terms of the surface syntax.

That's what DTD's are for. XML doesn't require that you type documents
in as text, you can convert them from other syntaxes, and edit them in
a graphical drag-and-drop pointy clicky environment, of which there
are a bunch for SGML (and most of them have been updated to work with
XML).

> Programs such as texinfo, groff, tex, whatever -- can be drasticly
> improved by stripping them of their horrible parsers and having them
> read the generic s-expression syntax (or convenient surface syntax)
> instead.
>
> I don't necessarily want my strutured documents to conform to just
> _one_ set of invariants (e.g. HTML).  So, there shouldn't be any
> such thing as a document "in HTML".  Documents would be in "generic
> structured text format" -- each representing a big s-expression.
> Some documents would "conform to HTML invariants" -- and that might
> mean, for example, that all lists and sublists in the S-expression
> begin with a symbol which is a valid HTML tag name, and that those
> lists have an appropriately long tail with elements of the correct
> types.  There is nothing to prevent the same document from
> "conforming to the groff/texinfo invariants".

That's what XML is all about. It's not at all HTML++, it's SGML--.

> RMS should put his foot down and insist on the development and
> implementation of such a syntax in all GNU applications, in my
> opinion.
>
> -t

I hope RMS has better things to do with his foot and his time than
start global flame wars about surface syntax standards.

The last thing the world needs now is an alternative to XML. I'm glad
the SGML people finally cleaned up their act and reached completion on
XML, because the HTML weenies, who didn't realize that HTML was an
*application* of the much more general purpose SGML meta-language that
they never heard of, kept inventing short sighted half assed kludges,
proprietary tags, and dead end shoehorn schemes in a million attempts
to nickle-and-dime away problems that a lot of smart people already
solved a long time ago. SGML is an industrial strength overkill, but
XML is a rational simplification of it that solves the problems neatly
and on purpose in a pre-meditated fashion, and arguing about the
syntax at this point is an outlandish waste of time, that would make
anyone who's been following the situation for more than a few years
roll their eyeballs up into the back of their head in dismay. What's
needed are good free tools to create and manipulate XML, not
alternative syntaxes that don't address the real problems which have
already been solved.  If you like the copyright notice on Microsoft's
free XML parser, then use that, or if not, use another one or write
your own. But don't try to invent GnuML just because you think
parenthesis are a better syntax than angled brackets. But if you must
do it, do it simply to fan your ego, not to save the world.

XML is certainly as general as S-Expressions. Elements can nest, and
they can also have properties and contain text. The DTD (document type
definition) defines the rules about which elements are allowed and
required, how they are allowed to nest, what properties are required
or optional, which external entities are referenced, etc. So any XML
editor worth its salt can read in the DTD, validate documents, and
provide structured editing of the data with ammenities like menus for
enumerated types, etc. S-expressions were not designed to solve the
same problems as XML. How do you link to external entities? How do you
assign IDs to S-expressions and atoms so you can link to them
internally or externally? At least ScriptX had a way to write out and
read back in circular lists and arbitrary graph structures. How do you
deal with text? Is there a standard format for meta-data?

	-Don