Beautiful Code [114]
Fully maintaining XML correctness normally involves two redundant checks on the data:
Validation occurs on input. As a parser reads an XML document, it checks the document for well-formedness and, optionally, validity. Well-formedness checks purely syntactic constraints, such as whether every start tag has a matching end tag. This is required of all XML parsers. Validity means that only elements and attributes specifically listed in a Document Type Definition (DTD) appear, and only in the proper positions.
Verification happens on output. When generating an XML document through an XML API such as DOM, JDOM, or XOM, the parser checks all strings passing through the API to make sure they're legal in XML.
While input validation is more thoroughly defined by the XML specification, output verification can be equally important. In particular, it is critical for debugging and making sure that the code is correct.
Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers > The Problem
5.2. The Problem
The very first beta releases of JDOM did not verify the strings used to create element names, text content, or pretty much anything else. Programs were free to generate element names that contained whitespace, comments that ended in hyphens, text nodes that contained nulls, and other malformed content. Maintaining the correctness of the generated XML was completely left up to the client programmer.
This bothered me. While XML is simpler than some alternatives, it is not simple enough that it can be fully understood without immersing yourself in specification arcana, such as exactly which Unicode code points are or are not legal in XML names and text content.
JDOM aimed to be an API that brought XML to the masses. JDOM aimed to be an API that, unlike DOM, did not require a two-week course and an expensive expert mentor to learn to use properly. To enable this, JDOM needed to lift as much of the burden of understanding XML from the programmer as possible. Properly implemented, JDOM would keep the programmer from making mistakes.
There are numerous ways JDOM could do this. Some of them fell out as a direct result of its data model. For instance, in JDOM it is not possible to overlap elements (
Sally said, let's go the park.
The name of an element, attribute, or processing instruction is a legal XML name
Local names do not contain colons
Attribute namespaces do not conflict with the namespaces of their parent element or sibling attributes
Every Unicode surrogate character appears as part of a surrogate pair consisting of one high surrogate followed by one low surrogate
Processing instruction data does not contain the two-character string ?>
Whenever the client supplies a string for use in one of these areas, it should be checked to see that it meets the relevant constraints. The details vary, but the basic approach is the same.
For purposes of this chapter, I'm going to examine the rules for checking XML 1.0 element names.
In the XML 1.0 specification (part of which is given in Example 5-1), rules are given in a Backus-Naur Form (BNF) grammar. Here #xdddd represents the Unicode code point with the hexadecimal value dddd. [#xdddd-#xeeee] represents all Unicode code points from #xdddd to #xeeee.
Example 5-1. BNF grammar for checking XML names (abridged)
Code View: Scroll / Show All
BaseChar ::= [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6]
NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
Name ::= (Letter | '_' | ':') (NameChar)*
Letter ::= BaseChar | Ideographic
Ideographic ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]
Digit ::= [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9]
| [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F]
| [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] | [#x0BE7-#x0BEF]
| [#x0C66-#x0C6F]