Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Beautiful Code [114]

By Root 7693 0

Fully maintaining XML correctness normally involves two redundant checks on the data:

Validation occurs on input. As a parser reads an XML document, it checks the document for well-formedness and, optionally, validity. Well-formedness checks purely syntactic constraints, such as whether every start tag has a matching end tag. This is required of all XML parsers. Validity means that only elements and attributes specifically listed in a Document Type Definition (DTD) appear, and only in the proper positions.

Verification happens on output. When generating an XML document through an XML API such as DOM, JDOM, or XOM, the parser checks all strings passing through the API to make sure they're legal in XML.

While input validation is more thoroughly defined by the XML specification, output verification can be equally important. In particular, it is critical for debugging and making sure that the code is correct.

Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers > The Problem

5.2. The Problem

The very first beta releases of JDOM did not verify the strings used to create element names, text content, or pretty much anything else. Programs were free to generate element names that contained whitespace, comments that ended in hyphens, text nodes that contained nulls, and other malformed content. Maintaining the correctness of the generated XML was completely left up to the client programmer.

This bothered me. While XML is simpler than some alternatives, it is not simple enough that it can be fully understood without immersing yourself in specification arcana, such as exactly which Unicode code points are or are not legal in XML names and text content.

JDOM aimed to be an API that brought XML to the masses. JDOM aimed to be an API that, unlike DOM, did not require a two-week course and an expensive expert mentor to learn to use properly. To enable this, JDOM needed to lift as much of the burden of understanding XML from the programmer as possible. Properly implemented, JDOM would keep the programmer from making mistakes.

There are numerous ways JDOM could do this. Some of them fell out as a direct result of its data model. For instance, in JDOM it is not possible to overlap elements (

Sally said, let's go the park.

. Then let's play ball.). Because JDOM's internal representation is a tree, there's simply no way to generate this markup from JDOM. However, a number of other constraints need to be checked explicitly, such as whether:

The name of an element, attribute, or processing instruction is a legal XML name

Local names do not contain colons

Attribute namespaces do not conflict with the namespaces of their parent element or sibling attributes

Every Unicode surrogate character appears as part of a surrogate pair consisting of one high surrogate followed by one low surrogate

Processing instruction data does not contain the two-character string ?>

Whenever the client supplies a string for use in one of these areas, it should be checked to see that it meets the relevant constraints. The details vary, but the basic approach is the same.

For purposes of this chapter, I'm going to examine the rules for checking XML 1.0 element names.

In the XML 1.0 specification (part of which is given in Example 5-1), rules are given in a Backus-Naur Form (BNF) grammar. Here #xdddd represents the Unicode code point with the hexadecimal value dddd. [#xdddd-#xeeee] represents all Unicode code points from #xdddd to #xeeee.

Example 5-1. BNF grammar for checking XML names (abridged)

Code View: Scroll / Show All

BaseChar ::= [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6]

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

Name ::= (Letter | '_' | ':') (NameChar)*

Letter ::= BaseChar | Ideographic

Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]

Digit ::= [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9]

| [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F]

| [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] | [#x0BE7-#x0BEF]

| [#x0C66-#x0C6F]

Online Book Reader

Beautiful Code [114]

®Online Book Reader