I just gave a lightning talk at the fabulous code4lib conference. My boss at Atom Publishing Protocol was cool and how it could teach people about RESTful Web Services and HTTP (again). Here are the slides: http://kfahlgren.com/talks/code4lib2008/atompub_teaches_rest_http.pdf.
Archive for the ‘AtomPub’ Category
Talk: AtomPub Makes You Cool at Code4Lib2008
Tuesday, February 26th, 2008How to Present Well (like Joe Gregorio)
Wednesday, July 25th, 2007A nice tidbit from Joe’s talk at Oscon 2007:
Exposition: I can lie if you can learn
Partial Updates: A Simpler Strawman?
Sunday, June 10th, 2007James Snell has been working some interesting things as the work on the Atom Publishing Protocol spec winds down. Most recently, he posted some thoughts on how to effectively communicate partial updates to APP servers using HTTP PATCH.
[UPDATE: James points out the obvious drawback to this approach in his response.]
One of the things that surprised me when I met other APP implementors at the interop was the relative lack of concern they seemed to have about the actual content inside their <atom:entry>
s. This may have simply been a simplification on their part for the sake of testing (“if it can accept a single line of XHTML div it can accept anything“, essentially) rather than their real views, but to someone very concerned about perfect content fidelity, it sorta scared me. These tiny <atom:entry>
s might hide the some of the problems that APP will face in the wild, particularly for document repositories.
Long before the interop, we’d decided internally at O’Reilly to use the Media Resources rather than the <atom:entry>
container (in large part because of the size of our DocBook documents, often over 2MB) for our document repository implementation. Because of the larger size of our content blocks, the sort of partial updates that James is thinking about might be quite cool.
The core of James’ strawman is an XML delta syntax (with credit due to Andy Roberts‘ work on the same) for HTTP PATCH with 8 operations: insert-before, insert-after, insert-child, replace, remove, remove-all, set-attribute and remove-attribute. Coming at this problem with my experience in document transformation and XSLT, I saw 7 of those operations (everything but ‘replace’) as unnecessary. The basic inspiration is thinking about each operation as an XSLT template. Mentally translate the d:replace/@path
into xsl:template/@match
and swap the bodies and you’ll be with me (with luck!).
Here’s the specific rundown of the 7 operations other than ‘replace’ working with James’ simple example <atom:entry>
:
1 <?xml version="1.0"?> 2 <entry xmlns="http://www.w3.org/2005/Atom"> 3 <id>http://example.org/foo/boo</id> 4 <title>Test</title> 5 <updated>2007-12-12T12:12:12Z</updated> 6 <summary>Test summary</summary> 7 <author> 8 <name>James</name> 9 </author> 10 <link href="http://example.org"/> 11 </entry>
Note: You’ll have to imagine these working on a much larger XML document than my examples to understand the importance.
insert-before
1 PATCH /collection/entry/1 HTTP/1.1 2 Host: example.org 3 Content-Type: application/delta+xml 4 Content-Length: nnnn 5 6 <d:delta 7 xmlns:d="http://purl.org/atompub/delta" 8 xmlns="http://www.w3.org/2005/Atom" 9 xmlns:atom="http://www.w3.org/2005/Atom" 10 xmlns:b="http://example.org/foo"> 11 12 <!-- substitute for insert-before 13 /atom:entry/atom:author/atom:name 14 an atom:email --> 15 <d:replace path="/atom:entry/atom:author"> 16 <atom:author> 17 <atom:email>james@example.org</atom:email> 18 <atom:name>James</atom:name> 19 </atom:author> 20 </d:replace> 21 </d:delta>
insert-after
1 PATCH /collection/entry/1 HTTP/1.1 2 Host: example.org 3 Content-Type: application/delta+xml 4 Content-Length: nnnn 5 6 <d:delta 7 xmlns:d="http://purl.org/atompub/delta" 8 xmlns="http://www.w3.org/2005/Atom" 9 xmlns:atom="http://www.w3.org/2005/Atom" 10 xmlns:b="http://example.org/foo"> 11 12 <!-- substitute for insert-after 13 /atom:entry/atom:author/atom:name 14 an atom:uri --> 15 <d:replace path="/atom:entry/atom:author"> 16 <atom:author> 17 <atom:name>James</atom:name> 18 <atom:uri>http://example.org/blogs/james</atom:uri> 19 </atom:author> 20 </d:replace> 21 </d:delta>
insert-child
1 PATCH /collection/entry/1 HTTP/1.1 2 Host: example.org 3 Content-Type: application/delta+xml 4 Content-Length: nnnn 5 6 <d:delta 7 xmlns:d="http://purl.org/atompub/delta" 8 xmlns="http://www.w3.org/2005/Atom" 9 xmlns:atom="http://www.w3.org/2005/Atom" 10 xmlns:b="http://example.org/foo"> 11 12 <!-- substitute for insert-child 13 /atom:entry/atom:author 14 an atom:uri --> 15 <d:replace path="/atom:entry/atom:author"> 16 <atom:author> 17 <atom:name>James</atom:name> 18 <atom:uri>http://example.org/blogs/james</atom:uri> 19 </atom:author> 20 </d:replace> 21 </d:delta>
remove
1 PATCH /collection/entry/1 HTTP/1.1 2 Host: example.org 3 Content-Type: application/delta+xml 4 Content-Length: nnnn 5 6 <d:delta 7 xmlns:d="http://purl.org/atompub/delta" 8 xmlns="http://www.w3.org/2005/Atom" 9 xmlns:atom="http://www.w3.org/2005/Atom" 10 xmlns:b="http://example.org/foo"> 11 12 <!-- substitute for remove 13 /atom:entry/atom:author/atom:name --> 14 <d:replace path="/atom:entry/atom:author/atom:name"> 15 </d:replace> 16 <!-- yeah, this no atom:author is longer valid ..--> 17 </d:delta>
remove-all
1 PATCH /collection/entry/1 HTTP/1.1 2 Host: example.org 3 Content-Type: application/delta+xml 4 Content-Length: nnnn 5 6 <d:delta 7 xmlns:d="http://purl.org/atompub/delta" 8 xmlns="http://www.w3.org/2005/Atom" 9 xmlns:atom="http://www.w3.org/2005/Atom" 10 xmlns:b="http://example.org/foo"> 11 12 <!-- substitute for remove 13 /atom:entry/atom:author/atom:name --> 14 <d:replace path="/atom:entry/*"> 15 </d:replace> 16 <!-- yeah, this atom:entry is no longer valid ..--> 17 </d:delta>
set-attribute
1 PATCH /collection/entry/1 HTTP/1.1 2 Host: example.org 3 Content-Type: application/delta+xml 4 Content-Length: nnnn 5 6 <d:delta 7 xmlns:d="http://purl.org/atompub/delta" 8 xmlns="http://www.w3.org/2005/Atom" 9 xmlns:atom="http://www.w3.org/2005/Atom" 10 xmlns:b="http://example.org/foo"> 11 12 <!-- substitute for set-attribute 13 /atom:entry/atom:link/@href 14 to http://not-example.org --> 15 <d:replace path="/atom:entry/atom:link/@href">http://not-example.org</d:replace> 16 </d:delta>
remove-attribute
1 PATCH /collection/entry/1 HTTP/1.1 2 Host: example.org 3 Content-Type: application/delta+xml 4 Content-Length: nnnn 5 6 <d:delta 7 xmlns:d="http://purl.org/atompub/delta" 8 xmlns="http://www.w3.org/2005/Atom" 9 xmlns:atom="http://www.w3.org/2005/Atom" 10 xmlns:b="http://example.org/foo"> 11 12 <!-- substitute for remove-attribute 13 /atom:entry/atom:link/@href --> 14 <d:replace path="/atom:entry/atom:link"> 15 <atom:link/> 16 </d:replace> 17 <!-- you can't take the easy way and match 18 the attribute, because an empty attribute 19 (@attr="") means something different than 20 the absence of @attr --> 21 <!-- and this atom:link is longer valid ..--> 22 </d:delta>
I think the above could be fairly easily implemented as a transformation into either XQuery or XSLT, but I’d imagine that it could be implemented using streaming techniques as well. Thoughts?
The Code Behind DocBook Elements in the Wild
Tuesday, May 1st, 2007[UPDATE: Added a link to the categorized CSV file below]
Here’s some of the nitty-gritty behind DocBook Elements in the Wild. We’re trying to get a count of all of the element names in a set of 49 DocBook 4.4 <book>s.
First, go ask the O’Reilly product database for all the books that were sent to the printer in 2006. Because I’m better at XML than Unix text tools, ask for mysql -X. Now we’ve got something like:
<resultset statement="select..."> <row> <field name="isbn13">9780596101619</field> <field name="title">Google Maps Hacks</field> <field name="edition">1</field> <field name="book_vendor_date">2006-01-05</field> </row> <row> <field name="isbn13">9780596008796</field> <field name="title">Excel Scientific and Engineering Cookbook</field> <field name="edition">1</field> <field name="book_vendor_date">2006-01-06</field> </row> <row> <field name="isbn13">9780596101732</field> <field name="title">Active Directory</field> <field name="edition">3</field> <field name="book_vendor_date">2006-01-06</field> </row> ...
Next, fun with XMLStarlet:
$ xml sel -t -m "//field[@name='isbn13']" -v '.' -n books_in_2006.xml 9780596101619 9780596008796 9780596101732 9780596009441 ...
Now, pull the content down from our Atom Publishing Protocol repository and make a big document with XIncludes:
#!/usr/bin/env ruby require 'kurt' require 'rexml/document' OUTFILE = "aggregate.xml" files_downloaded = [] ARGV.each {|atom_id| entry = Atom::Entry.get_entry("#{Kurt::PROD_RESOURCES}/#{CGI.escape(atom_id)}") filename = atom_id.gsub(/\W/, '') + ".xml" File.open(filename, "w") {|f| f.print entry.content } files_downloaded << filename } agg = REXML::Document.new agg.add_element("books") agg.root.add_namespace("xi", "http://www.w3.org/2001/XInclude") files_downloaded.each {|file| xi = agg.root.add_element("xi:include") xi.add_attribute("href", file) } File.open(OUTFILE, "w") {|f| agg.write(f, 2) }
Resolve all of the XIncludes into one big file:
$ xmllint --xinclude -o aggregate.xml aggregate.xml
It’s now pretty huge (well, huge in my world):
$ du -h aggregate.xml 102M aggregate.xml
At this point, we’re ready to do the real counting of the elements (slow REXML solution commented out in favor of a libxml-based solution):
#!/usr/bin/env ruby require 'rexml/parsers/pullparser' require 'rubygems' require 'xml/libxml' start = Time.now ARGV.each {|filename| counts = Hash.new # parser = REXML::Parsers::PullParser.new(File.new(filename)) # while parser.has_next? # el = parser.pull # if el.start_element? # element_name = el[0] # if counts[element_name] # counts[element_name] += 1 # else # counts[element_name] = 1 # end # end # end parser = XML::SaxParser.new parser.filename = filename parser.on_start_element {|element_name, _| if counts[element_name] counts[element_name] += 1 else counts[element_name] = 1 end } parser.parse File.open(filename + ".count.csv", "w") {|f| counts.each {|element_name, count| f.puts "\"#{element_name}\",#{count}" } } }
(Hooray for steam parsing, as this 100MB file was cranked through in 27 seconds on a 700MHz box!)
Finally, we’ve got CSV and we can do some graphing. Here’s the full CSV and the categorized CSV. Rather than working on a code-based graphing solution, I just messed with Excel. The result:
Here’s my favorite, a drill-down based on a categorization I just made up (click through for the drill-down):
Books used:
- Google Maps Hacks, 1e
- Excel Scientific and Engineering Cookbook, 1e
- Active Directory, 3e
- RFID Essentials, 1e
- Visual Basic 2005 in a Nutshell, 3e
- PSP Hacks, 1e
- Baseball Hacks, 1e
- Mind Performance Hacks, 1e
- Repairing and Upgrading Your PC, 1e
- Web Site Cookbook, 1e
- Flickr Hacks, 1e
- Fixing Access Annoyances, 1e
- Fixing PowerPoint Annoyances, 1e
- Programming SQL Server 2005, 1e
- Learning C# 2005, 2e
- Photoshop CS2 RAW, 1e
- Web Design in a Nutshell, 3e
- Google: The Missing Manual, 2e
- Don’t Get Burned on eBay, 1e
- The Art of SQL, 1e
- Fixing Windows XP Annoyances, 1e
- iPhoto 6: The Missing Manual, 1e
- iPod & iTunes: The Missing Manual, 4e
- Ajax Hacks, 1e
- Flash 8: The Missing Manual, 1e
- MySQL Stored Procedure Programming, 1e
- Flash 8: Projects for Learning Animation and Interactivity, 1e
- XAML in a Nutshell, 1e
- Linux Annoyances for Geeks, 1e
- Programming PHP, 2e
- Flash 8 Cookbook, 1e
- Learning SQL on SQL Server 2005, 1e
- Programming Excel with VBA and .NET, 1e
- iMovie 6 & iDVD: The Missing Manual, 1e
- Enterprise SOA, 1e
- Perl Hacks, 1e
- Java I/O, 2e
- Enterprise JavaBeans 3.0, 5e
- Building Scalable Web Sites, 1e
- MCSE Core Required Exams in a Nutshell, 3e
- DNS and BIND, 5e
- Learning PHP and MySQL, 1e
- Computer Security Basics, 2e
- Active Directory Cookbook, 2e
- Ubuntu Hacks, 1e
- Unicode Explained, 1e
- Digital Photography: The Missing Manual, 1e
- Ajax Design Patterns, 1e
- Python in a Nutshell, 2e
APP Interop Pictures
Tuesday, April 17th, 2007I took a couple of quick shots of the group:
and the grid:
More details on my xml.com blog post.
Tim Bray has better photos here.
Ruby and the Atom Publishing Protocol
Saturday, February 24th, 2007I gave a short talk at the first North Bay Ruby Users Group last Thursday (Feb 15, 2007) about my recent work implementing an Atom Publishing Protocol library in Ruby. Here’s the presentation: