Archive for May, 2007

The Code Behind DocBook Elements in the Wild

Tuesday, May 1st, 2007

[UPDATE: Added a link to the categorized CSV file below]

Here’s some of the nitty-gritty behind DocBook Elements in the Wild. We’re trying to get a count of all of the element names in a set of 49 DocBook 4.4 <book>s.

First, go ask the O’Reilly product database for all the books that were sent to the printer in 2006. Because I’m better at XML than Unix text tools, ask for mysql -X. Now we’ve got something like:

<resultset statement="select...">
        <field name="isbn13">9780596101619</field>
        <field name="title">Google Maps Hacks</field>
        <field name="edition">1</field>
        <field name="book_vendor_date">2006-01-05</field>
        <field name="isbn13">9780596008796</field>
        <field name="title">Excel Scientific and Engineering Cookbook</field>
        <field name="edition">1</field>
        <field name="book_vendor_date">2006-01-06</field>
        <field name="isbn13">9780596101732</field>
        <field name="title">Active Directory</field>
        <field name="edition">3</field>
        <field name="book_vendor_date">2006-01-06</field>

Next, fun with XMLStarlet:

$ xml sel -t -m "//field[@name='isbn13']" -v '.' -n books_in_2006.xml

Now, pull the content down from our Atom Publishing Protocol repository and make a big document with XIncludes:

#!/usr/bin/env ruby
require 'kurt'
require 'rexml/document'
OUTFILE = "aggregate.xml"
files_downloaded = []
ARGV.each {|atom_id|
  entry = Atom::Entry.get_entry("#{Kurt::PROD_RESOURCES}/#{CGI.escape(atom_id)}")
  filename = atom_id.gsub(/\W/, '') + ".xml", "w") {|f|
    f.print entry.content
  files_downloaded << filename

agg =
agg.root.add_namespace("xi", "")
files_downloaded.each {|file|
  xi = agg.root.add_element("xi:include")
  xi.add_attribute("href", file)
}, "w") {|f|
  agg.write(f, 2)

Resolve all of the XIncludes into one big file:

$ xmllint --xinclude -o aggregate.xml aggregate.xml 

It’s now pretty huge (well, huge in my world):

$ du -h aggregate.xml
102M    aggregate.xml

At this point, we’re ready to do the real counting of the elements (slow REXML solution commented out in favor of a libxml-based solution):

#!/usr/bin/env ruby
require 'rexml/parsers/pullparser'
require 'rubygems'
require 'xml/libxml'
start =
ARGV.each {|filename|
  counts =
#  parser =
#  while parser.has_next?
#    el = parser.pull
#    if el.start_element?
#      element_name = el[0]
#      if counts[element_name]
#        counts[element_name] += 1
#      else
#        counts[element_name] = 1
#      end
#    end
#  end
  parser =
  parser.filename = filename
  parser.on_start_element {|element_name, _|
    if counts[element_name]
      counts[element_name] += 1
      counts[element_name] = 1
  parser.parse + ".count.csv", "w") {|f|
    counts.each {|element_name, count|
      f.puts "\"#{element_name}\",#{count}"

(Hooray for steam parsing, as this 100MB file was cranked through in 27 seconds on a 700MHz box!)

Finally, we’ve got CSV and we can do some graphing. Here’s the full CSV and the categorized CSV. Rather than working on a code-based graphing solution, I just messed with Excel. The result:

DocBook Elements from 49 Books

Here’s my favorite, a drill-down based on a categorization I just made up (click through for the drill-down):

DocBook Elements from 49 Books, Categorized

Books used: