If you accept data from various sources, and want to produce XML
that can be consumed, one thing you need to be careful about is
character set issues.
On the input side, people often lie or make mistakes. Many
don’t specify an encoding, and while XML’s default is
utf-8, it is common to find iso-8859-1 or even win-1252 data.
On the output side, if you want to produce something that can be
consumed, then it behooves you to be aware that the quality of XML
parsers out there varies widely. Many of the initial feed
aggregators were no better than regular expressions, simply
ignoring character set issues and slapping descriptions into
HTML. While there has been much improvement on this front,
many still fall back to such behaviors when the encounter other,
unrelated, problems.
Carrying forward the experience I gained with my existing Python
implementation of my weblog, I’ve come up with
xchar.rb:
some data, two small methods, and six tests.
There has been lots of progress in the python world recently. I keep opening posts but not getting time to write them. So this is more a list than comment (well that is how it started, now grown a bit......
Over the last week, Planet RDF has seen more than a few posts and comments on the RDF/XML serialisation syntax, most of them looking into its (almost not enumerable) possible variations. Danny Ayers has a great overview with reference to the...
Luckily, I’m outside of arms reach. You see, my weblog is 100% valid XHTML 1.1, encoded as utf-8. Truth be told, however, it also would be considered as 100% valid XHTML 1.1, encoded as iso-8859-1 (roman), iso-8859-5 (cyrillic), win-1252 (Micro...
[more]
See the mail. Case in point - in my system all that goes out is raw, bona fide UTF-8. I would like to have it unescaped in my XML output as well. Right now every Russian letter I output via Builder gets escaped.
Just to be clear, it is not exactly raw. There are a number of characters that must be escaped. < and &, to name but two.
I just want to make sure that your issue is a cosmetic one, not a functional one. I guessed before what your issue might be, and apparently I guessed wrong. I’d like to not guess any more.
If it turns out that your issue is just the cosmetic one, I am quite prepared to make a patch that assumes that people that require 'jcode' and set $KCODE='u' are prepared to handle utf-8. Everyone else will get a bulkier, but safer, result.
Well, for me the issue is mostly cosmetic, yes - the feed becomes basically impossible to read in plain-text. Enable escaping on all characters and try to examine your feed - that’s what I see. Besides, there problems with your “escape-everything” approach:
1. I can’t handle feeds I might download using any simple text search tools (we had a discussion on that already)
2. I can’t grab these feeds and save them elsewhere without firing up a parser
3. Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.
4. Every feed I now generate is grown up in raw size, which means more download times and (if escalated) a bandwidth bill. Many of the people who have to read my feeds are on dialup.
5. RSS readers get confused when fed this stuff inside an HTML containing block within an entry, they don’t know what to display.
In short - unless you have an XML parser on the recieving side (which might not be the case) the “escape-all” approach is a no-no.
I would love to keep the encoding of basic entities like ampersands, but to be exempt from this... uhm... precaution. An xml.escape_utf = false or such would be perfect. Don’t expect UTF8 users in Ruby to be running with jcode all the time (I am not, for some time).
Besides you got a relatively foolproof way to verify if a passed string is UTF-8, if the encoding of the builder is set to UTF-8 you can just leave it as is.
Browsers have problems with this kind of escaping when it is used in the form context (for pre-filled values) when Builder is used to output HTML.
Oh, really? What browser do you use? Can it handle this?
The easiest way to verify a string is UTF-8 is to send it unpack("U*")
If you were to check the documentation, you would see that Jim has already provided means of inserting strings verbatim via the shift and symbol operators. Those that use these operators need to take extra care. I’m interesting in helping everybody else out.
I said I would write a patch to allow those that use jcode and pass in correct utf-8 to not have high bit characters escaped. I have now done so.
I’m merely a person submitting patches. It is Jim that you need to convince.
Tony Garnock-Jones: It is important to realize that Erlang was invented (in 1987) before utf-8 was (in 1992). Now, let’s explore the relationship between ASCII, ISO-8859-1, and UCS (a.k.a. Unicode), by way of example. ...
[more]
It extends the String class by providing the fast_xs
method to it (equivalent to to_xs) and is roughly
70 times faster. Hooked into Builder::XmlMarkup,
this provides roughly a ten-fold increase on some
RSS feeds I’m testing with Rails (1.2.3).
I’m using Builder and attempting to send serialized ruby into an xml node (Marshal.dump) — this doesn’t seem to be working though when I call Marshal.load after parsing the xml. Is this ruby’s Marshal object or a limitation of the encoding scheme?
Looked around and found this post from Sam Ruby that wrote the code to escape XML that was included in builder . Here is a short class I wrote to abstract out the XML escaping functionality, and be sure it is a string before calling to_xs on it....