Archive for the ‘Work’ Category

O’Reilly Release ePubs

Tuesday, July 15th, 2008

As of today, 30 O’Reilly titles are available as Ebook bundles and many will be in the Kindle Store later today:

As promised last month, O’Reilly has released 30 titles as DRM-free downloadable ebook bundles. The bundles include three ebook formats (EPUB, PDF, and Kindle-compatible Mobipocket) for a single price — at or below the book’s cover price.

I’ve spent a reasonable chunk of my year helping make this happen, both on the O’Reilly side and by adding .epub support to the DocBook-XSL stylesheets with Paul Norton of Adobe. Hopefully, our customers will be happy with the new formats.

Never Have I Felt More Famous

Monday, July 14th, 2008

Ah, the day my tent showed up in TechCrunch:

My Tent on Tech Crunch

For heaven’s sake, please just install hoe!

Thursday, February 21st, 2008

This is why people get annoyed about the silly gem install dependency mess (esp WRT hoe):

$ gem install heckle
Need to update 31 gems from http://gems.rubyforge.org   # fair enough, you're allowed
...............................
complete
Install required dependency ruby2ruby? [Yn]  Y    # Yeah, I know you need these
Install required dependency ParseTree? [Yn]  Y     # oh, and this one too
Select which gem to install for your platform (i686-linux)
 1. ParseTree 2.1.1 (ruby)                                   # yeah, just a vanilla ruby here
 2. ParseTree 2.1.1 (i386-mswin32)
 3. ParseTree 2.1.0 (ruby)
 4. ParseTree 2.0.2 (ruby)
 5. Skip this gem
 6. Cancel installation
> 1
Install required dependency RubyInline? [Yn]  Y   # This is also needed, I gather
Install required dependency hoe? [Yn]  Y             # Ha, hoe again, OK
Install required dependency rubyforge? [Yn] Y     # Don't care, don't understand why both
Install required dependency rake? [Yn]  Y            # There's no rake on this box, really?
Install required dependency hoe? [Yn]  Y             # WTF, yeah, I just said that
Install required dependency hoe? [Yn]  Y             # .. now you're just fucking with me
Install required dependency ZenTest? [Yn] Y        # Huh? I like ZenTest, but there's no reason...
Install required dependency hoe? [Yn]  Y             # **** YOU, hoe!
Successfully installed heckle-1.4.1
Successfully installed ruby2ruby-1.1.8
....

The fourth time it asked me, it decided to trust me and actually starting installing the gems….

RescueTime Is Da Bomb!

Thursday, February 14th, 2008

I’ve been using RescueTime since the fall, after hearing about it from some YCombinator-related person. It’s an absolutely spectacular application, and has really changed the way I understand my work and computer use. They also just released a cool widget:




EDIT: I can’t get WordPress to not screw up the widget markup. THEN: Raw-HTML to the rescue!

XML for Publishers at TOC

Thursday, February 14th, 2008



Keith Fahlgren at his TOC Tutorial, XML for Publishers

Originally uploaded by duncandavidson

I just got back from New York and the second annual O’Reilly Tools of Change for Publishing (TOC) Conference. It’s become a very impressive conference in just two years and had impressive attendance and speakers this year. There’s good blog coverage from George Walkley and pointers to more from the new TOC blog.

I had the honor of doing a tutorial on the last day and had a great time talking with and teaching an energized, question-happy audience about XML in the publishing industry. If you weren’t able to make it to TOC this year, you can pre-order the DVDs of four of the eight tutorials, including mine, and get 30% off with discount code TOCD3. Here’s the link: XML for Publishers.

ISBN10 to ISBN13

Thursday, January 24th, 2008

As of the beginning of 2007, ISBN10 is dead. Now we’re in a world that allows “979″ prefixes, though the following code doesn’t expect them yet…

Here’s some stuff to turn your 10-digit ISBNs into 13-digit ISBNs, naively assuming “978″, following an the API post doing the same from LibraryThing. There’s another one at isbn.org for humans.

Code from O’Reilly’s internal stuff:

module Isbn
  def self.dash_isbn(isbn)
    raise ArgumentError.new("ISBN argument must be string") unless isbn.is_a?(String)
    if isbn.length == 10
      return isbn[/^./] + "-" + isbn[1..3] + "-" + isbn[4..8] + "-" + isbn[/.$/]
    elsif isbn.length == 13
      return isbn[0..2] + "-" + isbn[3].chr + "-" + isbn[4..6] + "-" + isbn[7..11] + "-" + isbn[/.$/]
    else
      raise ArgumentError.new("ISBN must be 10 or 13 characters")
    end
  end

  def self.isbn10toisbn13(isbn)
    raise ArgumentError.new("ISBN argument must be string") unless isbn.is_a?(String)
    raise ArgumentError.new("ISBN must be of length 10") unless isbn.length == 10
    prefix = "978"
    isbn12 = prefix + isbn[0...-1]
    return isbn12 + check_digit_13(isbn12).to_s
  end  

  def self.check_digit_13(isbn_12)
    # http://www.barcodeisland.com/ean13.phtml
    # need to subtract remainder from 10
    # and do exemption for zero LMS 08.29.2006
    raise ArgumentError.new("ISBN must be of length 12") unless isbn_12.length == 12
    sum = 0
    odds = 0
    evens = 0
    isbn_12.scan(/\d/).each_with_index {|d, i|
      if (i % 2) == 0
        evens = evens + (d.to_i * 1)
      else
        odds = odds + (d.to_i * 3)
      end
    }
    sum = evens + odds
    digit = sum % 10
    if digit.zero?
      return 0
    else
      return 10 - digit
    end
  end
end # of module Isbns

The last test method relies on my database, which you won’t have. Replace it with something else you trust or drop it.

#!/usr/bin/env ruby

require 'test/unit'
require 'pdb'
require 'isbn'

class IsbnTest < Test::Unit::TestCase
  def setup
    @isbn10 = "059610123X"
    @isbn13 = "978059610123X"
  end
  def test_dash_isbn
    # must be a String
    assert_raise ArgumentError do Isbn.dash_isbn(123) end
    # must be 10 or 13 characters
    assert_raise ArgumentError do Isbn.dash_isbn("123") end
    assert_equal("0-596-10123-X", Isbn.dash_isbn(@isbn10))
    assert_equal("978-0-596-10123-X", Isbn.dash_isbn(@isbn13))
  end
  def test_isbn10toisbn13
    # must be a String
    assert_raise ArgumentError do Isbn.isbn10toisbn13(123) end
    # must be 10 characters
    assert_raise ArgumentError do Isbn.isbn10toisbn13(@isbn13) end
    assert_equal("9780596101237", Isbn.isbn10toisbn13(@isbn10))
  end
  def test_check_digit_13
    # must be 12 characters
    assert_raise ArgumentError do Isbn.check_digit_13(@isbn13) end
    assert_equal(7, Isbn.check_digit_13(@isbn13[0..-2]))
    assert_equal(0, Isbn.check_digit_13("123456789018"))
  end
  def test_isbn13
    ["0596008627", "1565929470", "0596002734", "0596004001", "0596527357",
     "0596101635", "0596005059", "0596527063", "1565926374", "0596526946",
     "1565926420", "0596101805", "059652742X", "156592455X", "0596006446",
     "0596008473", "0596009607", "0596100582", "0596100493", "0596004427",
     "1565925890", "1565927141", "059651610X"].each {|isbn|
      puts "Testing #{isbn}"
      assert_equal(PDB::ProdDB.new(isbn).isbn13, Isbn.isbn10toisbn13(isbn), "Bad checksum!")
    }
  end
end

How to Present Well (like Joe Gregorio)

Wednesday, July 25th, 2007

A nice tidbit from Joe’s talk at Oscon 2007:

Exposition: I can lie if you can learn

DocBook-XSL Sytlesheets have >600 Parameters

Wednesday, June 13th, 2007

Norm Walsh writes:

Stylesheets can have literally hundreds of parameters. The DocBook XSL Stylesheets have more than six hundred.

All I can say at this point is: wow. Grepping the core of our own customization shows 121 <xsl:param>s (about 20 of which we introduced) and 52 <xsl:attribute-set>s (20, again). Thinking about it now (as I haven’t before), we’ve probably minimized that number by completely overriding 13 of the “regular” fo/ stylesheets directly (rather than using params or smaller, single-template overrides). The DocBook-XSL sytlesheets are a truly impressive, complex project.

Their complexity brings me to the other DocBook-related news item from today, in which Bob DuCharme argues that XHTML 2:

will hit a sweet spot between the richness of DocBook and the simplicity of XHTML 1

I’m certainly hopeful that our work in the DocBook SubCommittee for Publishers will move a subset of DocBook closer to that “sweet spot”.

Partial Updates: A Simpler Strawman?

Sunday, June 10th, 2007

James Snell has been working some interesting things as the work on the Atom Publishing Protocol spec winds down. Most recently, he posted some thoughts on how to effectively communicate partial updates to APP servers using HTTP PATCH.

[UPDATE: James points out the obvious drawback to this approach in his response.]

One of the things that surprised me when I met other APP implementors at the interop was the relative lack of concern they seemed to have about the actual content inside their <atom:entry>s. This may have simply been a simplification on their part for the sake of testing (“if it can accept a single line of XHTML div it can accept anything, essentially) rather than their real views, but to someone very concerned about perfect content fidelity, it sorta scared me. These tiny <atom:entry>s might hide the some of the problems that APP will face in the wild, particularly for document repositories.

Long before the interop, we’d decided internally at O’Reilly to use the Media Resources rather than the <atom:entry> container (in large part because of the size of our DocBook documents, often over 2MB) for our document repository implementation. Because of the larger size of our content blocks, the sort of partial updates that James is thinking about might be quite cool.

The core of James’ strawman is an XML delta syntax (with credit due to Andy Roberts‘ work on the same) for HTTP PATCH with 8 operations: insert-before, insert-after, insert-child, replace, remove, remove-all, set-attribute and remove-attribute. Coming at this problem with my experience in document transformation and XSLT, I saw 7 of those operations (everything but ‘replace’) as unnecessary. The basic inspiration is thinking about each operation as an XSLT template. Mentally translate the d:replace/@path into xsl:template/@match and swap the bodies and you’ll be with me (with luck!).

Here’s the specific rundown of the 7 operations other than ‘replace’ working with James’ simple example <atom:entry>:

 1 <?xml version="1.0"?>
 2 <entry xmlns="http://www.w3.org/2005/Atom">
 3   <id>http://example.org/foo/boo</id>
 4   <title>Test</title>
 5   <updated>2007-12-12T12:12:12Z</updated>
 6   <summary>Test summary</summary>
 7   <author>
 8     <name>James</name>
 9   </author>
10   <link href="http://example.org"/>
11 </entry>

Note: You’ll have to imagine these working on a much larger XML document than my examples to understand the importance.

insert-before

 1 PATCH /collection/entry/1 HTTP/1.1
 2 Host: example.org
 3 Content-Type: application/delta+xml
 4 Content-Length: nnnn
 5 
 6 <d:delta
 7   xmlns:d="http://purl.org/atompub/delta"
 8   xmlns="http://www.w3.org/2005/Atom"
 9   xmlns:atom="http://www.w3.org/2005/Atom"
10   xmlns:b="http://example.org/foo">
11 
12   <!-- substitute for insert-before
13        /atom:entry/atom:author/atom:name
14        an atom:email -->
15   <d:replace path="/atom:entry/atom:author">
16     <atom:author>
17       <atom:email>james@example.org</atom:email>
18       <atom:name>James</atom:name>
19     </atom:author>
20   </d:replace>
21 </d:delta>

insert-after

 1 PATCH /collection/entry/1 HTTP/1.1
 2 Host: example.org
 3 Content-Type: application/delta+xml
 4 Content-Length: nnnn
 5 
 6 <d:delta
 7   xmlns:d="http://purl.org/atompub/delta"
 8   xmlns="http://www.w3.org/2005/Atom"
 9   xmlns:atom="http://www.w3.org/2005/Atom"
10   xmlns:b="http://example.org/foo">
11 
12   <!-- substitute for insert-after
13        /atom:entry/atom:author/atom:name
14        an atom:uri -->
15   <d:replace path="/atom:entry/atom:author">
16     <atom:author>
17       <atom:name>James</atom:name>
18       <atom:uri>http://example.org/blogs/james</atom:uri>
19     </atom:author>
20   </d:replace>
21 </d:delta>

insert-child

 1 PATCH /collection/entry/1 HTTP/1.1
 2 Host: example.org
 3 Content-Type: application/delta+xml
 4 Content-Length: nnnn
 5 
 6 <d:delta
 7   xmlns:d="http://purl.org/atompub/delta"
 8   xmlns="http://www.w3.org/2005/Atom"
 9   xmlns:atom="http://www.w3.org/2005/Atom"
10   xmlns:b="http://example.org/foo">
11 
12   <!-- substitute for insert-child
13        /atom:entry/atom:author
14        an atom:uri -->
15   <d:replace path="/atom:entry/atom:author">
16     <atom:author>
17       <atom:name>James</atom:name>
18       <atom:uri>http://example.org/blogs/james</atom:uri>
19     </atom:author>
20   </d:replace>
21 </d:delta>

remove

 1 PATCH /collection/entry/1 HTTP/1.1
 2 Host: example.org
 3 Content-Type: application/delta+xml
 4 Content-Length: nnnn
 5 
 6 <d:delta
 7   xmlns:d="http://purl.org/atompub/delta"
 8   xmlns="http://www.w3.org/2005/Atom"
 9   xmlns:atom="http://www.w3.org/2005/Atom"
10   xmlns:b="http://example.org/foo">
11 
12   <!-- substitute for remove
13        /atom:entry/atom:author/atom:name -->
14   <d:replace path="/atom:entry/atom:author/atom:name">
15   </d:replace>
16   <!-- yeah, this no atom:author is longer valid ..-->
17 </d:delta>

remove-all

 1 PATCH /collection/entry/1 HTTP/1.1
 2 Host: example.org
 3 Content-Type: application/delta+xml
 4 Content-Length: nnnn
 5 
 6 <d:delta
 7   xmlns:d="http://purl.org/atompub/delta"
 8   xmlns="http://www.w3.org/2005/Atom"
 9   xmlns:atom="http://www.w3.org/2005/Atom"
10   xmlns:b="http://example.org/foo">
11 
12   <!-- substitute for remove
13        /atom:entry/atom:author/atom:name -->
14   <d:replace path="/atom:entry/*">
15   </d:replace>
16   <!-- yeah, this atom:entry is no longer valid ..-->
17 </d:delta>

set-attribute

 1 PATCH /collection/entry/1 HTTP/1.1
 2 Host: example.org
 3 Content-Type: application/delta+xml
 4 Content-Length: nnnn
 5 
 6 <d:delta
 7   xmlns:d="http://purl.org/atompub/delta"
 8   xmlns="http://www.w3.org/2005/Atom"
 9   xmlns:atom="http://www.w3.org/2005/Atom"
10   xmlns:b="http://example.org/foo">
11 
12   <!-- substitute for set-attribute
13        /atom:entry/atom:link/@href 
14        to http://not-example.org -->
15   <d:replace path="/atom:entry/atom:link/@href">http://not-example.org</d:replace>
16 </d:delta>

remove-attribute

 1 PATCH /collection/entry/1 HTTP/1.1
 2 Host: example.org
 3 Content-Type: application/delta+xml
 4 Content-Length: nnnn
 5 
 6 <d:delta
 7   xmlns:d="http://purl.org/atompub/delta"
 8   xmlns="http://www.w3.org/2005/Atom"
 9   xmlns:atom="http://www.w3.org/2005/Atom"
10   xmlns:b="http://example.org/foo">
11 
12   <!-- substitute for remove-attribute
13        /atom:entry/atom:link/@href -->
14   <d:replace path="/atom:entry/atom:link">
15     <atom:link/>
16   </d:replace>
17   <!-- you can't take the easy way and match
18        the attribute, because an empty attribute
19        (@attr="") means something different than
20        the absence of @attr -->
21   <!-- and this atom:link is longer valid ..-->
22 </d:delta>

I think the above could be fairly easily implemented as a transformation into either XQuery or XSLT, but I’d imagine that it could be implemented using streaming techniques as well. Thoughts?

The Code Behind DocBook Elements in the Wild

Tuesday, May 1st, 2007

[UPDATE: Added a link to the categorized CSV file below]

Here’s some of the nitty-gritty behind DocBook Elements in the Wild. We’re trying to get a count of all of the element names in a set of 49 DocBook 4.4 <book>s.

First, go ask the O’Reilly product database for all the books that were sent to the printer in 2006. Because I’m better at XML than Unix text tools, ask for mysql -X. Now we’ve got something like:

<resultset statement="select...">
 <row>
        <field name="isbn13">9780596101619</field>
        <field name="title">Google Maps Hacks</field>
        <field name="edition">1</field>
        <field name="book_vendor_date">2006-01-05</field>
  </row>
  <row>
        <field name="isbn13">9780596008796</field>
        <field name="title">Excel Scientific and Engineering Cookbook</field>
        <field name="edition">1</field>
        <field name="book_vendor_date">2006-01-06</field>
  </row>
  <row>
        <field name="isbn13">9780596101732</field>
        <field name="title">Active Directory</field>
        <field name="edition">3</field>
        <field name="book_vendor_date">2006-01-06</field>
  </row>
  ...

Next, fun with XMLStarlet:

$ xml sel -t -m "//field[@name='isbn13']" -v '.' -n books_in_2006.xml
9780596101619
9780596008796
9780596101732
9780596009441
...

Now, pull the content down from our Atom Publishing Protocol repository and make a big document with XIncludes:

#!/usr/bin/env ruby
require 'kurt'
require 'rexml/document'
OUTFILE = "aggregate.xml"
files_downloaded = []
ARGV.each {|atom_id|
  entry = Atom::Entry.get_entry("#{Kurt::PROD_RESOURCES}/#{CGI.escape(atom_id)}")
  filename = atom_id.gsub(/\W/, '') + ".xml"
  File.open(filename, "w") {|f|
    f.print entry.content
  }
  files_downloaded << filename
}

agg = REXML::Document.new
agg.add_element("books")
agg.root.add_namespace("xi", "http://www.w3.org/2001/XInclude")
files_downloaded.each {|file|
  xi = agg.root.add_element("xi:include")
  xi.add_attribute("href", file)
}
File.open(OUTFILE, "w") {|f|
  agg.write(f, 2)
}

Resolve all of the XIncludes into one big file:

$ xmllint --xinclude -o aggregate.xml aggregate.xml 

It’s now pretty huge (well, huge in my world):

$ du -h aggregate.xml
102M    aggregate.xml

At this point, we’re ready to do the real counting of the elements (slow REXML solution commented out in favor of a libxml-based solution):

#!/usr/bin/env ruby
require 'rexml/parsers/pullparser'
require 'rubygems'
require 'xml/libxml'
start = Time.now
ARGV.each {|filename|
  counts = Hash.new
#  parser = REXML::Parsers::PullParser.new(File.new(filename))
#  while parser.has_next?
#    el = parser.pull
#    if el.start_element?
#      element_name = el[0]
#      if counts[element_name]
#        counts[element_name] += 1
#      else
#        counts[element_name] = 1
#      end
#    end
#  end
  parser = XML::SaxParser.new
  parser.filename = filename
  parser.on_start_element {|element_name, _|
    if counts[element_name]
      counts[element_name] += 1
    else
      counts[element_name] = 1
    end
  }
  parser.parse

  File.open(filename + ".count.csv", "w") {|f|
    counts.each {|element_name, count|
      f.puts "\"#{element_name}\",#{count}"
    }
  }
}

(Hooray for steam parsing, as this 100MB file was cranked through in 27 seconds on a 700MHz box!)

Finally, we’ve got CSV and we can do some graphing. Here’s the full CSV and the categorized CSV. Rather than working on a code-based graphing solution, I just messed with Excel. The result:

DocBook Elements from 49 Books

Here’s my favorite, a drill-down based on a categorization I just made up (click through for the drill-down):

DocBook Elements from 49 Books, Categorized

Books used: