before_filter :fix_bugs: On XML Parsers.

Ruby has recently become mature in the world of XML processing, with libxml-ruby finally hitting 1.x. Of course, we've already got quite a few, including Nokogiri, which is also based off of libxml2. Then there's the old standbys, Hpricot and REXML, which is typically installed with Ruby.

That's awesome news, right? Of course it is. And the benchmarks for libxml-ruby and Nokogiri are astounding for XML documents. Hpricot will always hold a place in my heart for HTML documents, as its especially good at picking up silly HTML/CSS markup.

However, there's one concern that's bugged me, and that's gems that rely on one of these parsers.

First of all, each one is changing all of the time, and not all are compatible with Ruby 1.9.x; one may even end up not being supported at some time in the future. Plus, some Ruby hosts (particularly the inexpensive ones) don't let you install any C-based gem willy-nilly. You have to go through the sysadmin, they make sure the code is safe, it's a big mess.

So why not support all of them? That's what I suggest.

In a soon-to-be-released little toy project of mine, a wrapper for the fmylife.com API, I needed to parse XML. So I wrote a tiny module called CanParse:

 module CanParse
def xml_doc(body)
  case FMyLife.parser
  when :nokogiri
    Nokogiri::XML(body)
  when :hpricot
    Hpricot(body)
  when :rexml
    REXML::Document.new(body)
  when :libxml
    LibXML::XML::Parser.string(body).parse
  end
end

def xpath(element,path)
  case FMyLife.parser
  when :nokogiri
    element.xpath(path)
  when :hpricot
    puts "in hpricot"
    element/path
  when :rexml
    REXML::XPath.match(element,path)
  when :libxml
    element.find(path)
  end
end

  #gets content of a node
def xml_content(element)
  case FMyLife.parser
  when :nokogiri
    element.content
  when :hpricot
    element.inner_text
  when :rexml
    element.text
  when :libxml
    element.content
  end
end

def xml_attribute(element,attribute)
  case FMyLife.parser
  when :nokogiri
    element[attribute]
  when :hpricot
    element.get_attribute(attribute)
  when :rexml
    element.attributes[attribute]
  when :nokogiri
    element.attributes[attribute]
  end
end
end

I include that method in my classes that require parsing, and bam - I can use any of the big 4 XML parsers. If you need different functions, then just add a new method, pull up the RDocs of the respective parsers - it literally takes about 10 minutes for each one. And you don't need to change any code, anywhere.

Note: the funny gsub I use for Hpricot's XPath is because it will assume it is a regular tag, and that can cause a little fruitiness with FMyLife's XML documents. Feel free to tweak it as necessary.

before_filter :fix_bugs

Wednesday, March 25, 2009

On XML Parsers.

No comments:

Post a Comment

Followers

Blog Archive

About Me