[UPDATE: Added a link to the categorized CSV file below]
Here’s some of the nitty-gritty behind DocBook Elements in the Wild. We’re trying to get a count of all of the element names in a set of 49 DocBook 4.4 <book>s.
First, go ask the O’Reilly product database for all the books that were sent to the printer in 2006. Because I’m better at XML than Unix text tools, ask for mysql -X. Now we’ve got something like:
<resultset statement="select...">
<row>
<field name="isbn13">9780596101619</field>
<field name="title">Google Maps Hacks</field>
<field name="edition">1</field>
<field name="book_vendor_date">2006-01-05</field>
</row>
<row>
<field name="isbn13">9780596008796</field>
<field name="title">Excel Scientific and Engineering Cookbook</field>
<field name="edition">1</field>
<field name="book_vendor_date">2006-01-06</field>
</row>
<row>
<field name="isbn13">9780596101732</field>
<field name="title">Active Directory</field>
<field name="edition">3</field>
<field name="book_vendor_date">2006-01-06</field>
</row>
...
Next, fun with XMLStarlet:
$ xml sel -t -m "//field[@name='isbn13']" -v '.' -n books_in_2006.xml
9780596101619
9780596008796
9780596101732
9780596009441
...
Now, pull the content down from our Atom Publishing Protocol repository and make a big document with XIncludes:
#!/usr/bin/env ruby
require 'kurt'
require 'rexml/document'
OUTFILE = "aggregate.xml"
files_downloaded = []
ARGV.each {|atom_id|
entry = Atom::Entry.get_entry("#{Kurt::PROD_RESOURCES}/#{CGI.escape(atom_id)}")
filename = atom_id.gsub(/\W/, '') + ".xml"
File.open(filename, "w") {|f|
f.print entry.content
}
files_downloaded << filename
}
agg = REXML::Document.new
agg.add_element("books")
agg.root.add_namespace("xi", "http://www.w3.org/2001/XInclude")
files_downloaded.each {|file|
xi = agg.root.add_element("xi:include")
xi.add_attribute("href", file)
}
File.open(OUTFILE, "w") {|f|
agg.write(f, 2)
}
Resolve all of the XIncludes into one big file:
$ xmllint --xinclude -o aggregate.xml aggregate.xml
It’s now pretty huge (well, huge in my world):
$ du -h aggregate.xml
102M aggregate.xml
At this point, we’re ready to do the real counting of the elements (slow REXML solution commented out in favor of a libxml-based solution):
#!/usr/bin/env ruby
require 'rexml/parsers/pullparser'
require 'rubygems'
require 'xml/libxml'
start = Time.now
ARGV.each {|filename|
counts = Hash.new
# parser = REXML::Parsers::PullParser.new(File.new(filename))
# while parser.has_next?
# el = parser.pull
# if el.start_element?
# element_name = el[0]
# if counts[element_name]
# counts[element_name] += 1
# else
# counts[element_name] = 1
# end
# end
# end
parser = XML::SaxParser.new
parser.filename = filename
parser.on_start_element {|element_name, _|
if counts[element_name]
counts[element_name] += 1
else
counts[element_name] = 1
end
}
parser.parse
File.open(filename + ".count.csv", "w") {|f|
counts.each {|element_name, count|
f.puts "\"#{element_name}\",#{count}"
}
}
}
(Hooray for steam parsing, as this 100MB file was cranked through in 27 seconds on a 700MHz box!)
Finally, we’ve got CSV and we can do some graphing. Here’s the full CSV and the categorized CSV. Rather than working on a code-based graphing solution, I just messed with Excel. The result:

Here’s my favorite, a drill-down based on a categorization I just made up (click through for the drill-down):

Books used:
- Google Maps Hacks, 1e
- Excel Scientific and Engineering Cookbook, 1e
- Active Directory, 3e
- RFID Essentials, 1e
- Visual Basic 2005 in a Nutshell, 3e
- PSP Hacks, 1e
- Baseball Hacks, 1e
- Mind Performance Hacks, 1e
- Repairing and Upgrading Your PC, 1e
- Web Site Cookbook, 1e
- Flickr Hacks, 1e
- Fixing Access Annoyances, 1e
- Fixing PowerPoint Annoyances, 1e
- Programming SQL Server 2005, 1e
- Learning C# 2005, 2e
- Photoshop CS2 RAW, 1e
- Web Design in a Nutshell, 3e
- Google: The Missing Manual, 2e
- Don’t Get Burned on eBay, 1e
- The Art of SQL, 1e
- Fixing Windows XP Annoyances, 1e
- iPhoto 6: The Missing Manual, 1e
- iPod & iTunes: The Missing Manual, 4e
- Ajax Hacks, 1e
- Flash 8: The Missing Manual, 1e
- MySQL Stored Procedure Programming, 1e
- Flash 8: Projects for Learning Animation and Interactivity, 1e
- XAML in a Nutshell, 1e
- Linux Annoyances for Geeks, 1e
- Programming PHP, 2e
- Flash 8 Cookbook, 1e
- Learning SQL on SQL Server 2005, 1e
- Programming Excel with VBA and .NET, 1e
- iMovie 6 & iDVD: The Missing Manual, 1e
- Enterprise SOA, 1e
- Perl Hacks, 1e
- Java I/O, 2e
- Enterprise JavaBeans 3.0, 5e
- Building Scalable Web Sites, 1e
- MCSE Core Required Exams in a Nutshell, 3e
- DNS and BIND, 5e
- Learning PHP and MySQL, 1e
- Computer Security Basics, 2e
- Active Directory Cookbook, 2e
- Ubuntu Hacks, 1e
- Unicode Explained, 1e
- Digital Photography: The Missing Manual, 1e
- Ajax Design Patterns, 1e
- Python in a Nutshell, 2e