Wednesday, March 25, 2009

On XML Parsers.

Ruby has recently become mature in the world of XML processing, with libxml-ruby finally hitting 1.x. Of course, we've already got quite a few, including Nokogiri, which is also based off of libxml2. Then there's the old standbys, Hpricot and REXML, which is typically installed with Ruby.

That's awesome news, right? Of course it is. And the benchmarks for libxml-ruby and Nokogiri are astounding for XML documents. Hpricot will always hold a place in my heart for HTML documents, as its especially good at picking up silly HTML/CSS markup.

However, there's one concern that's bugged me, and that's gems that rely on one of these parsers.

First of all, each one is changing all of the time, and not all are compatible with Ruby 1.9.x; one may even end up not being supported at some time in the future. Plus, some Ruby hosts (particularly the inexpensive ones) don't let you install any C-based gem willy-nilly. You have to go through the sysadmin, they make sure the code is safe, it's a big mess.

So why not support all of them? That's what I suggest.

In a soon-to-be-released little toy project of mine, a wrapper for the fmylife.com API, I needed to parse XML. So I wrote a tiny module called CanParse:

 module CanParse
def xml_doc(body)
case FMyLife.parser
when :nokogiri
Nokogiri::XML(body)
when :hpricot
Hpricot(body)
when :rexml
REXML::Document.new(body)
when :libxml
LibXML::XML::Parser.string(body).parse
end
end

def xpath(element,path)
case FMyLife.parser
when :nokogiri
element.xpath(path)
when :hpricot
puts "in hpricot"
element/path
when :rexml
REXML::XPath.match(element,path)
when :libxml
element.find(path)
end
end
  #gets content of a node
def xml_content(element)
case FMyLife.parser
when :nokogiri
element.content
when :hpricot
element.inner_text
when :rexml
element.text
when :libxml
element.content
end
end

def xml_attribute(element,attribute)
case FMyLife.parser
when :nokogiri
element[attribute]
when :hpricot
element.get_attribute(attribute)
when :rexml
element.attributes[attribute]
when :nokogiri
element.attributes[attribute]
end
end
end
I include that method in my classes that require parsing, and bam - I can use any of the big 4 XML parsers. If you need different functions, then just add a new method, pull up the RDocs of the respective parsers - it literally takes about 10 minutes for each one. And you don't need to change any code, anywhere.

Note: the funny gsub I use for Hpricot's XPath is because it will assume it is a regular tag, and that can cause a little fruitiness with FMyLife's XML documents. Feel free to tweak it as necessary.

No comments:

Post a Comment