Beautiful Code [121]
}
else {
throwIllegalNameException(name, "0x"
+ Integer.toHexString(c) + " is not a legal NCName character");
}
}
}
}
Name character verification is now an O(1) operation, and verification of a full name is O(n), where n is the length of the name. You can fiddle with the code to improve the constant factors, as I have, but it's hard to see how this could be faster while still making the necessary checks. However, we're not done yet.
Correct, Beautiful, Fast (in That Order): Lessons from Designing XML Verifiers > Version 6: Fourth Optimization: Caching
5.8. Version 6: Fourth Optimization: Caching
If you can't make the verification go any faster, the only remaining option is not to do it, or at least not do so much of it. This approach was suggested by Wolfgang Hoschek. He noticed that in an XML document the same names keep coming up over and over. For instance, in an XHTML document, there are only about 100 different element names, and a few dozen of those account for most elements (p, table, div, span, strong, and so on). Once you've verified that a name is legal, you can store it in a collection somewhere. The next time you see a name, you first check to see whether it's one of the names you've seen before; if it is, you just accept it and don't check it again.
However, you do have to be very careful here. It may take longer to find some names (especially shorter ones) in a collection such as a hash map than it would take to check them all over again. The only way to tell is to benchmark and profile the caching scheme very carefully on several different VMs using different kinds of documents. You may need to tune parameters such as the size of the collection to fit different kinds of documents, and what works well for one document type may not work well for another. Furthermore, if the cache is shared between threads, thread contention can become a serious problem.
Consequently, I have not yet implemented this scheme for element names. However, I have implemented it for namespace URIs (Uniform Resource Identifiers), which have even more expensive verification checks than do element names, and which are even more repetitive. For instance, many documents have only one namespace URI, and very few have more than four, so the potential gain here is much larger. Example 5-11 shows the inner class that XOM uses to cache namespace URIs after it has verified them.
Example 5-11. A cache for verified namespace URIs
Code View: Scroll / Show All
private final static class URICache {
private final static int LOAD = 6;
private String[] cache = new String[LOAD];
private int position = 0;
synchronized boolean contains(String s) {
for (int i = 0; i < LOAD; i++) {
// Here I'm assuming the namespace URIs are interned.
// This is commonly but not always true. This won't
// break if they haven't been. Using equals() instead
// of == is faster when the namespace URIs haven't been
// interned but slower if they have.
if (s == cache[i]) {
return true;
}
}
return false;
}
synchronized void put(String s) {
cache[position] = s;
position++;
if (position == LOAD) position = 0;
}
}
There are a couple of surprising features in this class. First, rather than using the obvious hash map or table, it uses a fixed-size array with a linear search. For such small lists, the constant overhead of hash lookup is slower than simply iterating through the array.
Second, if the array fills up, it is not expanded. New data just overwrites the old, starting at the first position. This behavior leaves the data structure open to an attack that could decrease performance; it would still function correctly, but more slowly. It's extremely unlikely that any real-world XML document would have such a problem, though. Few have more than six namespaces, and in the rare cases when that happens, the namespaces tend to be localized, not randomly placed throughout the document. Performance hits that result from resetting the array should be very temporary.
The one case I can imagine