It’s just data

XML Cleansing

If you accept data from various sources, and want to produce XML that can be consumed, one thing you need to be careful about is character set issues.

On the input side, people often lie or make mistakes.  Many don’t specify an encoding, and while XML’s default is utf-8, it is common to find iso-8859-1 or even win-1252 data.

On the output side, if you want to produce something that can be consumed, then it behooves you to be aware that the quality of XML parsers out there varies widely.  Many of the initial feed aggregators were no better than regular expressions, simply ignoring character set issues and slapping descriptions into HTML.  While there has been much improvement on this front, many still fall back to such behaviors when the encounter other, unrelated, problems.

Carrying forward the experience I gained with my existing Python implementation of my weblog, I’ve come up with xchar.rb: some data, two small methods, and six tests.

One potential use of this would be in Ruby’s XML Builder:

class Builder::XmlMarkup < Builder::XmlBase
  def _escape(text)
    text.to_xs
  end
end

“May don’t specify an encoding”

That’s ‘Many’, not ‘May’, right?

Posted by Dilip at

Dilip: fixed.  Thanks!

Posted by Sam Ruby at

one thing you need to be careful about is character set issues.

That should be character encoding issues, right?

Posted by Jim at

Sam Ruby: XML Cleansing

[link]...

Excerpt from del.icio.us/tag/ruby at

Python Web goodies

There has been lots of progress in the python world recently. I keep opening posts but not getting time to write them. So this is more a list than comment (well that is how it started, now grown a bit......

Excerpt from 42 at

RDF as XML

Over the last week, Planet RDF has seen more than a few posts and comments on the RDF/XML serialisation syntax, most of them looking into its (almost not enumerable) possible variations. Danny Ayers has a great overview with reference to the...

Excerpt from Planet RDF at

Dragons be gone

Luckily, I’m outside of arms reach.  You see, my weblog is 100% valid XHTML 1.1, encoded as utf-8. Truth be told, however, it also would be considered as 100% valid XHTML 1.1, encoded as iso-8859-1 (roman), iso-8859-5 (cyrillic), win-1252 (Micro... [more]

Trackback from Sam Ruby

at

Sam Ruby: XML Cleansing

Someone at Smarking has bookmarked your post.... [more]

Trackback from Smarking

at

Sooo, this is why all my new Rails apps produce unreadable XML with the new Builder...

Sam, can you guess in three steps why this was a very bad solution (as I figure you coined it)?

Posted by Julik at

Sam, can you guess in three steps why this was a very bad solution

Encoding other than utf-8?

I can point to Rails apps that produced unreadable XML with the old builder.

If you provide more details, I can construct a test case and a fix.

Posted by Sam Ruby at

Here’s a patch which assumes that people who specify an encoding other than utf-8 know what they are doing.

Posted by Sam Ruby at

See the mail. Case in point - in my system all that goes out is raw, bona fide UTF-8. I would like to have it unescaped in my XML output as well. Right now every Russian letter I output via Builder gets escaped.

Posted by Julik at

Just to be clear, it is not exactly raw.  There are a number of characters that must be escaped.  < and &, to name but two.

I just want to make sure that your issue is a cosmetic one, not a functional one.  I guessed before what your issue might be, and apparently I guessed wrong.  I’d like to not guess any more.

If it turns out that your issue is just the cosmetic one, I am quite prepared to make a patch that assumes that people that require 'jcode' and set $KCODE='u' are prepared to handle utf-8.  Everyone else will get a bulkier, but safer, result.

Posted by Sam Ruby at

Well, for me the issue is mostly cosmetic, yes - the feed becomes basically impossible to read in plain-text. Enable escaping on all characters and try to examine your feed - that’s what I see. Besides, there problems with your “escape-everything” approach:

1. I can’t handle feeds I might download using any simple text search tools (we had a discussion on that already)
2. I can’t grab these feeds and save them elsewhere without firing up a parser
3. Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.
4. Every feed I now generate is grown up in raw size, which means more download times and (if escalated) a bandwidth bill. Many of the people who have to read my feeds are on dialup.
5. RSS readers get confused when fed this stuff inside an HTML containing block within an entry, they don’t know what to display.

In short - unless you have an XML parser on the recieving side (which might not be the case) the “escape-all” approach is a no-no.

I would love to keep the encoding of basic entities like ampersands, but to be exempt from this... uhm... precaution. An xml.escape_utf = false or such would be perfect. Don’t expect UTF8 users in Ruby to be running with jcode all the time (I am not, for some time).

Posted by Julik at

Besides you got a relatively foolproof way to verify if a passed string is UTF-8, if the encoding of the builder is set to UTF-8 you can just leave it as is.

Posted by Julik at

Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.

Oh, really?  What browser do you use?  Can it handle this?

The easiest way to verify a string is UTF-8 is to send it unpack("U*")

If you were to check the documentation, you would see that Jim has already provided means of inserting strings verbatim via the shift and symbol operators.  Those that use these operators need to take extra care.  I’m interesting in helping everybody else out.

I said I would write a patch to allow those that use jcode and pass in correct utf-8 to not have high bit characters escaped.  I have now done so.

I’m merely a person submitting patches.  It is Jim that you need to convince.

Posted by Sam Ruby at

Pronto. Will try to create a patch that leaves bona-fide UTF-8 intact if the builder is instructed for utf-8.

Posted by Julik at

Julik, something you might want to try first:

require 'builder'
b = Builder::XmlMarkup.new
b.rights "\xC2\xA9 2006"

puts "before: " + b.target!

class Builder::XmlMarkup
  def target!
    @target.gsub(/&#(\d+);/) {[$1.to_i].pack('U*')}
  end
end

puts "after:  " + b.target!
Posted by Sam Ruby at

ASCII, ISO-8859-1, UCS, and Erlang

Tony Garnock-Jones:  It is important to realize that Erlang was invented (in 1987) before utf-8 was (in 1992). Now, let’s explore the relationship between ASCII, ISO-8859-1, and UCS (a.k.a. Unicode), by way of example. ... [more]

Trackback from Sam Ruby

at

I’ve written a C implementation of your code here:

[link]

It extends the String class by providing the fast_xs
method to it (equivalent to to_xs) and is roughly
70 times faster.  Hooked into Builder::XmlMarkup,
this provides roughly a ten-fold increase on some
RSS feeds I’m testing with Rails (1.2.3).

I’ll tell Jim and _why (Hpricot) about it, too

Posted by Eric Wong at

Hi Sam,

I’m using Builder and attempting to send serialized ruby into an xml node (Marshal.dump) — this doesn’t seem to be working though when I call Marshal.load after parsing the xml. Is this ruby’s Marshal object or a limitation of the encoding scheme?

Thanks,
Matt

Posted by Matt Mitchell at

Escaping XML in Ruby

Looked around and found this post from Sam Ruby that wrote the code to escape XML that was included in builder . Here is a short class I wrote to abstract out the XML escaping functionality, and be sure it is a string before calling to_xs on it....

Excerpt from Stuff to Help You Out at

Add your comment