Archive for the ‘DocBook’ Category

O’Reilly Release ePubs

Tuesday, July 15th, 2008

As of today, 30 O’Reilly titles are available as Ebook bundles and many will be in the Kindle Store later today:

As promised last month, O’Reilly has released 30 titles as DRM-free downloadable ebook bundles. The bundles include three ebook formats (EPUB, PDF, and Kindle-compatible Mobipocket) for a single price — at or below the book’s cover price.

I’ve spent a reasonable chunk of my year helping make this happen, both on the O’Reilly side and by adding .epub support to the DocBook-XSL stylesheets with Paul Norton of Adobe. Hopefully, our customers will be happy with the new formats.

XML for Publishers at TOC

Thursday, February 14th, 2008



Keith Fahlgren at his TOC Tutorial, XML for Publishers

Originally uploaded by duncandavidson

I just got back from New York and the second annual O’Reilly Tools of Change for Publishing (TOC) Conference. It’s become a very impressive conference in just two years and had impressive attendance and speakers this year. There’s good blog coverage from George Walkley and pointers to more from the new TOC blog.

I had the honor of doing a tutorial on the last day and had a great time talking with and teaching an energized, question-happy audience about XML in the publishing industry. If you weren’t able to make it to TOC this year, you can pre-order the DVDs of four of the eight tutorials, including mine, and get 30% off with discount code TOCD3. Here’s the link: XML for Publishers.

DocBook-XSL Sytlesheets have >600 Parameters

Wednesday, June 13th, 2007

Norm Walsh writes:

Stylesheets can have literally hundreds of parameters. The DocBook XSL Stylesheets have more than six hundred.

All I can say at this point is: wow. Grepping the core of our own customization shows 121 <xsl:param>s (about 20 of which we introduced) and 52 <xsl:attribute-set>s (20, again). Thinking about it now (as I haven’t before), we’ve probably minimized that number by completely overriding 13 of the “regular” fo/ stylesheets directly (rather than using params or smaller, single-template overrides). The DocBook-XSL sytlesheets are a truly impressive, complex project.

Their complexity brings me to the other DocBook-related news item from today, in which Bob DuCharme argues that XHTML 2:

will hit a sweet spot between the richness of DocBook and the simplicity of XHTML 1

I’m certainly hopeful that our work in the DocBook SubCommittee for Publishers will move a subset of DocBook closer to that “sweet spot”.

The Code Behind DocBook Elements in the Wild

Tuesday, May 1st, 2007

[UPDATE: Added a link to the categorized CSV file below]

Here’s some of the nitty-gritty behind DocBook Elements in the Wild. We’re trying to get a count of all of the element names in a set of 49 DocBook 4.4 <book>s.

First, go ask the O’Reilly product database for all the books that were sent to the printer in 2006. Because I’m better at XML than Unix text tools, ask for mysql -X. Now we’ve got something like:

<resultset statement="select...">
 <row>
        <field name="isbn13">9780596101619</field>
        <field name="title">Google Maps Hacks</field>
        <field name="edition">1</field>
        <field name="book_vendor_date">2006-01-05</field>
  </row>
  <row>
        <field name="isbn13">9780596008796</field>
        <field name="title">Excel Scientific and Engineering Cookbook</field>
        <field name="edition">1</field>
        <field name="book_vendor_date">2006-01-06</field>
  </row>
  <row>
        <field name="isbn13">9780596101732</field>
        <field name="title">Active Directory</field>
        <field name="edition">3</field>
        <field name="book_vendor_date">2006-01-06</field>
  </row>
  ...

Next, fun with XMLStarlet:

$ xml sel -t -m "//field[@name='isbn13']" -v '.' -n books_in_2006.xml
9780596101619
9780596008796
9780596101732
9780596009441
...

Now, pull the content down from our Atom Publishing Protocol repository and make a big document with XIncludes:

#!/usr/bin/env ruby
require 'kurt'
require 'rexml/document'
OUTFILE = "aggregate.xml"
files_downloaded = []
ARGV.each {|atom_id|
  entry = Atom::Entry.get_entry("#{Kurt::PROD_RESOURCES}/#{CGI.escape(atom_id)}")
  filename = atom_id.gsub(/\W/, '') + ".xml"
  File.open(filename, "w") {|f|
    f.print entry.content
  }
  files_downloaded << filename
}

agg = REXML::Document.new
agg.add_element("books")
agg.root.add_namespace("xi", "http://www.w3.org/2001/XInclude")
files_downloaded.each {|file|
  xi = agg.root.add_element("xi:include")
  xi.add_attribute("href", file)
}
File.open(OUTFILE, "w") {|f|
  agg.write(f, 2)
}

Resolve all of the XIncludes into one big file:

$ xmllint --xinclude -o aggregate.xml aggregate.xml 

It’s now pretty huge (well, huge in my world):

$ du -h aggregate.xml
102M    aggregate.xml

At this point, we’re ready to do the real counting of the elements (slow REXML solution commented out in favor of a libxml-based solution):

#!/usr/bin/env ruby
require 'rexml/parsers/pullparser'
require 'rubygems'
require 'xml/libxml'
start = Time.now
ARGV.each {|filename|
  counts = Hash.new
#  parser = REXML::Parsers::PullParser.new(File.new(filename))
#  while parser.has_next?
#    el = parser.pull
#    if el.start_element?
#      element_name = el[0]
#      if counts[element_name]
#        counts[element_name] += 1
#      else
#        counts[element_name] = 1
#      end
#    end
#  end
  parser = XML::SaxParser.new
  parser.filename = filename
  parser.on_start_element {|element_name, _|
    if counts[element_name]
      counts[element_name] += 1
    else
      counts[element_name] = 1
    end
  }
  parser.parse

  File.open(filename + ".count.csv", "w") {|f|
    counts.each {|element_name, count|
      f.puts "\"#{element_name}\",#{count}"
    }
  }
}

(Hooray for steam parsing, as this 100MB file was cranked through in 27 seconds on a 700MHz box!)

Finally, we’ve got CSV and we can do some graphing. Here’s the full CSV and the categorized CSV. Rather than working on a code-based graphing solution, I just messed with Excel. The result:

DocBook Elements from 49 Books

Here’s my favorite, a drill-down based on a categorization I just made up (click through for the drill-down):

DocBook Elements from 49 Books, Categorized

Books used:

Exploiting FrameMaker MIF as XML, Introduction

Saturday, February 3rd, 2007

My O’Reilly colleague Andy Bruno has just written a pair of posts on converting FrameMaker’s MIF (link may be old/die) format into XML (henceforth ‘MX’). I’ll be writing a few posts outlining the ways in which we’ve leveraged MX at O’Reilly.

[Update: Series continues here with getting back into MIF, and reading bookfiles.]

(more…)

DocBook Dinner

Friday, December 8th, 2006



Cheers!

Originally uploaded by psd.

The highlight of XML Conference 2006 (more coverage here, here, and here.