It’s just data

Attractive Nuisance

Slides from my Sells’ DevCon presentation.

Note: the correct display of some characters in the presentation may depend on what fonts you have installed.  Some pages may display better on Mozilla on large screens, but everything should mostly work cross browser.


RE: Attractive Nuisance

I didn't understand what you meant in the slide "Escaping in XML is broken". Can you expand on that a little bit.

Message from Dare Obasanjo

at

Dare: two additional links to provide context: Keith Richards and Pointy end.

The fact of the matter is that people will (and have!) created content in Atom that they declare to be well formed that is not, and declare that is properly double escaped, again that is not.

Being a text format, XML is an "attractive nuisance" in that it encourages people to create documents with technologies as simple text based templates.  My weblog being a prime example.  I've gone to extraordinary lengths in an attempt to compensate, but things still slip through momentarily.

While I wish XML had picked a less error prone syntax, at this point I don't have a suggestion for solving this problem (frankly, the thought of binary XML gives me pause); but the point is (1) this is something that people need to be aware of, and (2) not all the blame can, or should, go to the people who have fallen into this trap - the authors of the spec need to shoulder some of the responsibility.

Posted by Sam Ruby at

RE: Attractive Nuisance

Sam,
  So it seems you aren't saying anything is broken per se. Just that merely by being text based and having a similar syntax to HTML it encourages the ViewSourceClan to make mistakes. I buy that, that has been my experience working with XML over the past few yers.

Message from Dare Obasanjo

at

Dare: let me make an analogy: the WSDOT periodically identifies some intersections as HALs.  It then treats this list as bugs reports, and schedules projects to correct these problems.

Posted by Sam Ruby at

RE: Attractive Nuisance

Sam,
  This goes back to my original question then. What exactly is broken in XML that you think needs to be fixed to correct these problems? So far it seems the main issue you've pointed out is that it looks deceptively simple to the average web developer. However I don't think coming up with a more complex looking replacement for XML is a feasible suggestion. So what do you think can and should be done?

Message from Dare Obasanjo

at

Great summary of issues, Sam. The QNames slide is the ticket!

(It's not fair, since I'm seeing them out of context, but reading the slides online reminded me of Tufte's powerpoint essay:  [link] )

Posted by Jay Fienberg at

Klystkeliai

per Attractive Nuisance slidus atėjau į gerą svetainė, kur rašoma daug naugingų dalykų apie eXP ir šiaip programavimą ir planavimą (design). Beto labai fainas (smagus, esminis, ++ - kaip pasakytų kolegos) projektas yra Yellow bike. Žmonės panašūs į poną Zuoką (bet, pasirodo, taupeni, atsagesni) rado senų dviračių sutaisė, rado kas juos padažytų, rado kas paženklintų dviračius ir paliko dviračius gatvėje. Pradėję nuo 10, istorijos užrašymo moemntu jau turi 100 ir ruošė dar 100. Tokios istorijos....... [more]

Trackback from loading

at

Great stuff. I think I like your postulate...

One question: given all the breakage/leakage/confusage you've identified, how could Planet RDF do things better?

Posted by Danny at

Sam Ruby’s slides on the pitfalls around Unicode, XML, HTTP etc. The analogy is a box of matches to an 8 year old. Ruby postulates: The accuracy of metadata is inversely proportional to the square of the distance between the data and the...

Excerpt from Planet RDF at

how could Planet RDF do things better?

Danny, you can see the problem above.

Is the Planet RDF code available?  I would gladly provide a patch.

The problem is not in your input feed.  You contain the following code: ’.  Planet RDF converts that into binary, which I will express in hex: xC3A2C280C299.  The correct formulation would be xE28099.  I've got a good idea about what is going on under the covers as

In other words, for some reason, RDF Planet is effectively doing a iso-8859-1 to utf-8 conversion, on utf-8 data.

Posted by Sam Ruby at

I had a funny experience reading these slides, since my browser window isn't that big and I'm using Konqueror; the title was half-hidden, the >> sign was outside the visible area, and the << sign was not to be seen, so I didn't find anything to click on.

That provoked me to think about the only text that I actually saw -- "How did you learn to read HTML?" -- and take it as the first step in a quiz.

And, though it turned out that that was not what you'd intended, "View Source" did eventually help me to figure out how I was supposed to navigate the slides ;-)

Posted by Benja Fallenstein at

Thanks Sam. Dave Beckett has given a detailed response:

[link]

It suggests the conversion is taking place in a certain liberal parser...

Posted by Danny at

tidy emits UTF-8 encoded bytes and python attempts to read them as ASCII

If the code is reading a stream of utf-8 encoded bytes as if it were a stream of characters, and then one attempts to output those characters as utf-8, you would effectively end up with a iso-8859-1 to utf-8 conversion being done on utf-8 encoded characters.

Is the Planet RDF code available?  I would gladly provide a patch.

Posted by Sam Ruby at

OK, I found Chumpologica.  Here's the patch:

--- chumpologica.original       2004-01-24 11:34:47.000000000 -0500
+++ chumpologica.py     2004-10-22 09:05:25.891708800 -0400
@@ -118,7 +118,7 @@

         file(tmp_file, "w").write(body)
         result=os.popen("tidy -config "+self.config['config_dir']+"/tidy.cfg "+tmp_file+" 2>/dev/null", 'r')
-        out=result.read()
+        out=unicode(result.read(),'utf-8')
         os.unlink(tmp_file)

         title=re.compile(r"<title>(.+?)</title>", re.DOTALL).search(out)
Posted by Sam Ruby at

Your patch does utf-8 encoding earlier, making a Python Unicode string (u'foo').  I
had to remove a later bit of code that tried to utf-8 encode things again that writes
the RSS content body.  I'm not sure if that's fixed what you thought was broke.  There
is still a mess in encoding titles from rss:title and my attempt to fix that just gives
a rather unhelpful python error that it cannot write Unicode to a file, only allowing ascii.
Python's Unicode support remains user hostile at every turn.

Posted by Dave Beckett at

My patch does utf-8 de-coding earlier, making a Python Unicode string.  Note: the latest feedparser uses a real XML parser whenever possible, so you would start out with a unicode string for this feed if a current version of the feedparser were used.

Python has two string-like data types: str and unicode.  Python's support for unicode data in str objects is very user hostile.  The reverse is not true.  There, IMHO, the strategy should be to cleanse the data as early as possible, and to encode to utf-8 as late as possible - thereby keeping the data in unicode as long as possible.

I'll look into making a more comprehensive patch.

Posted by Sam Ruby at

HCI

See also: Attractive Nuisance, The Legend of View Source...

Excerpt from Raw at

XML future: evolution or revolution?

The biggest dilemma I see for XML and the related technologies is that their reason for existience is to provide an agreed-upon mechanism for interoperability across platforms and applications. but those agreements are hard to update without...

Excerpt from mikechampion's weblog at

XML future: evolution or revolution?

The biggest dilemma I see for XML and the related technologies is that their reason for existience is to provide an agreed-upon mechanism for interoperability across platforms and applications. but those agreements are hard to update without...

Excerpt from mikechampion's weblog at

XML encoding problems

Attractive Nuisance contains a link to an interesting slideshow about XML encoding problems, starting from charsets going over to attribute order, whitespaces, entities, double escaping etc. The slide titled “QNames” just reads "don’t even get me...

Excerpt from Martins Notepad at

HTTP GET is an Attractive Nuisance

Paraphrasing Sam Ruby: Jon Udell: HTTP toolkits make it easy to do the wrong thing, hard to do the right thing. Dare Obasanjo: del.icio.us, flickr, and Bloglines use GET for edit resources. Sam Ruby: AJAX toolkits must beware of how they use GET....

Excerpt from More Like This WebLog at

Frisson de Folksonomie

Oh, look at the date. It’s been a while, hasn’t it? Indulge me if you will in part 6 of the Things Fall Apart series. A slight detour perhaps, or rather a journey into the heart of dar- ... technology, and matrimony.A Social Bookmarking Affair...

Excerpt from BlogAfrica at

XML Character Encoding on the Web

How should we be specifying the character encoding of XML documents on the web?...

Excerpt from XML Character Encoding on the Web [dionidium.com] at

A long, slightly rambling, but deeply technical, entry follows, so if you aren't interested in software internationalization, character sets and variable type systems, you might want to skip this entry entirely; you have been warned.

how could Planet RDF do things better? Danny, you can see the problem above. Is the Planet RDF code available? I would gladly provide a patch. The problem is not in your input feed. You contain the following code: &#8217;. Planet RDF converts...

Excerpt from The Boston Diaries at

Ruby's postulate

[link]...

Excerpt from del.icio.us/david.schreiber at

Data on the web

Salient excerpts: Content-Type: text/html;charset=US-ASCII <meta http-equiv=”content-type” content=”text/html; charset=utf-8″ /> More supporting evidence for Ruby’s postulate, I guess. P.S. Amusingly, the page...

Excerpt from Comments on: Data on the web at

DevCon: Fundamentalism

DevCon: Fundamentalism - “The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.” ([link])...

Excerpt from Inelegant at

Udell vrs Traoré?

So Jon Udell, in the midst of demonstrating the increasing maturity of query languages (XPath and XQuery) on xml databases, generates statistics of bloggers he reads who most frequently cite books on Amazon.com. In analyzing the data, it is clear...

Excerpt from Koranteng's Toli at

Get On The Bus

An open note to my some of my favourite loosely-coupled people Phil Wainewright , Jon Udell , Sam Ruby , Tessa Lau and Monsieur Feinberg amongst others (connecting once again). Glue Layer People | Technology Adoption and Systems Design | A lighter...

Excerpt from Koranteng's Toli at

REST - The Web Style

I give you the slightly updated slides to a talk I gave on Friday to the Lotus Workplace Architecture Board. The topic was REST - The web style An argument about an outlook on technology, on complexity, layering and leverage. (It’s also available a...

Excerpt from Koranteng's Toli at

Ruby's postulate: The accuracy of metadata

[link]...

Excerpt from del.icio.us/xagronaut at

Ruby's Postulate

Ruby’s postulate : The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata. Alternatively: Keep your friends close but your meta-data closer...

Excerpt from Weirdest Undreamt Use Case at

HOWTO Avoid Being Called a Bozo When Producing XML

Dos and don’ts about producing XML programmatically....

Excerpt from Henri Sivonen’s pages at

DevCon: Fundamentalism

The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata....

Excerpt from Delicious/blip2 at

Add your comment