It’s just data

Last Line of Defense

GIGO.  It is easier to produce correct XML output if you have correct XML input.  One way to achieve this is to ensure that data that is not well formed XML can never to be stored.  With Ruby on Rails, this can be enforced with validation rules that invoke a parser, and throw an error upon failure, thus:

require 'xml/parser'
 
class Entry < ActiveRecord::Base

  validates_each :title, :summary, :content do |model, attr, value|
    @@xmlparser ||= XML::Parser.new
    begin
      @@xmlparser.parse "<div>#{value}</div>" if value
    rescue
      model.errors.add attr, 'is not well formed XML'
    ensure
      @@xmlparser.reset
    end
  end

end

And tests such as these can verify the correct operation:

class EntryTest < Test::Unit::TestCase
  fixtures :entries

  def setup
    @entry = Entry.find(:first)
  end

  def test_title_not_wellformed
    @entry.title = "AT&amp;T"
    assert @entry.save, message="well formed title can't be saved"

    @entry.title = "AT&T"
    assert ! @entry.save, message="not well formed title saved"
    assert_equal "is not well formed XML", @entry.errors.on(:title)
  end

end

As a footnote, the verification logic took three attempts to get right.  My first attempt was to use REXML.  While it is certainly the most elegant Ruby XML API, it seems to accept a variety of ill-formed XML fragments, for example the following produces no error:

require 'rexml/document'
REXML::Document.new("<div>at&t")

Next, I tried libxml2.  While the following correctly reported the errors, it also did so on STDERR.

require 'xml/libxml'
p = XML::Parser.new
p.string = "<div>at&t"
p.parse

My third attempt uses Expat and serves my needs just fine.


Peeking into the implementation of REXML, I see that it is riddled with regular expressions.  Having a parser that doesn’t detect errors properly is one thing, but having a parser that incorrectly parses valid input is quite another.  I’ve opened a ticket on one such problem.  Depending on how it is received, I may open others.

Posted by Sam Ruby at

What’s the rubylang solution to detect an Atom entry with two atom:id elements?

Posted by Robert Sayre at

validates_uniqueness_of :atomid

Posted by Sam Ruby at

While the following correctly reported the errors, it also did so on STDERR

I had the same problem. After lots of trawling through undocumented spaghetti code, I found it can be solved (in C) with a simple xmlSetGenericErrorFunc(NULL,xmlErrorHandler), where xmlErrorHandler is a dummy function that does nothing. Don’t know about Ruby.

Posted by Graham Parks at

Sam Ruby: Last Line of Defense

Neat technique for validating XML input in Rails...

Excerpt from del.icio.us/tag/rails at

Today's XML WTF

via Sam Ruby : While [ REXML ] is certainly the most elegant Ruby XML API, it seems to accept a variety of ill-formed XML fragments, for example the following produces no error: [ <div>at&t ] F’real? That is, not only missing end tag, but...

Excerpt from Planet XML at

Uche and Chimezie Ogbuji: Today's XML WTF

via Sam Ruby: While [ REXML] is certainly the most elegant Ruby XML API, it seems to accept a variety of ill-formed XML fragments, for example the following produces no error: [<div>at&t] F’real? That is, not only missing end tag, but...

Excerpt from Planet Swhack at

Patch for libxml2’s Ruby binding

I  mentioned previously that libxml2 had a habit of writing to STDERR.  With the Python bindings, this can be mitigated by the use of an error handler global to the library.  The steps below describe how to add equivalent functionality to Ruby’s... [more]

Trackback from Sam Ruby

at

Crawling LJ

For the final project for my web architecture class, I can choose what I want to do as long as it’s sufficiently webby. I have a lot of ideas saved, but I’ll probably work alone and the project is due in less than a month; the proposal is due...

Excerpt from Thought Torrent at

On Ruby

I have previously admired the Ruby language, albeit from a distance, and been impressed by the vigor of the Rails community. In the last week I have written a few hundred lines of Ruby code that actually do something useful and I’ll probably release...

Excerpt from ongoing at

Sam Ruby: Last Line of Defense

REXML apparently has some serious correctness issues. Too bad; it’s an excellent API and quite nice to work with....

Excerpt from del.icio.us/ekidd/ruby+xml at

Add your comment