intertwingly

It’s just data

Sgmllib patch


The last place I figured I would be patching when I saw this bug was the Python runtime library.

The problem is markup in titles alt attributes.  No, not those titles, or these titles, rather these alt attributes.

The way it all started was that Georg von Hippel used a TeX2PNG to convert a mathematical equation into an image.  That software puts the original LaTeX source into the title alt attribute.

George then used Konqueror to copy and paste the result into Blogger’s web interface.  Apparently Konqueror inserted a new line into the attribute value.  This arguably is suboptimal, but legal.  Blogger than proceeded to to convert the new line — even though it was in an attribute value — to a <br /> sequence.  While this is not what is intended, existing browsers (I’ve tested it on Firefox, IE, and Opera) took this all in stride as this was within a quoted string.

However, programs based on Python’s SGMLLib, like the Universal Feed Parser and BeautifulSoup throw up a hairball.  If you look in the source, you will see:

# XXX The following should skip matching quotes (' or ")

Ouch.  Test case and patch submitted.  For those who can’t wait, here is a workaround that will work with existing versions of Python:

if sgmllib.endbracket.search(' <').start(0):
    class EndBracketMatch:
	endbracket = re.compile(r'/?[a-zA-Z][-_.:a-zA-Z0-9]*\s*('
		r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*'
		r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]'
		r'[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*(?=[\s>/<])))?'
	    r')*\s*/?\s*(?=[<>])')
	def search(self,string,index=0):
	    self.match = self.endbracket.match(string,index)
	    if self.match: return self
	def start(self,n):
	    return self.match.end(n)
    sgmllib.endbracket = EndBracketMatch()

While testing this patch, I noticed that there is a surprise in the Python SVN Head - character references will be substituted in attribute values.  While this is the way it always should have been, this will come as a surprise to many.  I’ve committed a few changes to the Universal Feed Parser so that it will accomodate both current releases and the SVN Head version.

Particularly problematic are the substitution of character references.  Substituting &lt; &gt; and &amp; will cause naïve programs (or programs explicitly coded to the current behavior of sgmllib) which consume and produce HTML to no longer be able to round trip their results.  But worse is the handling of numeric character references.  Decimal (but not hexadecimal) character references which are expressible in iso-8859-1 are converted to strings (not Unicode, but strings).  If the enclosing data is in another encoding (such as utf-8), this creates a problem.