It’s just data

Engineer for Serendipity

Rob Sayre: Here’s how my last post ended up looking on Planet Intertwingly.

$ python
Python 2.5.2 (r252:60911, Oct  5 2008, 19:29:17) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from feedparser import parse
>>> feed=parse('http://blog.mozilla.com/rob-sayre/feed/atom/')
>>> feed.entries[2].content[0].value
u'<p><a href="http://visitmix.com/Articles/Web-Standards-Where-the-ROI-is#200901200614472">Joshua Allen</a>: <i>&#8220;While proper use of web standards is about a lot more than validation, and we purposely stay away from the topic of validation in these articles, the page validates XHTML strict with 0 errors for me, using W3C validator. What are you using to validate?&#8221;</i></p>\n<p>The source for that page contains the string</p>\n<p><code>&lt;br></code>&lt;br></code></p>\n<p>I wonder what Joshua thinks that means.</p>'

Rob may have missed a name.


"Rob may have missed a name."

“Rob may have missed a name.” - Sam Ruby: Engineer for Serendipity . LOL....

Excerpt from Brandt's Tumbling Log at

hahaha!

Posted by Rob Sayre at

r292

That problem was very subtle.  A regular expression was too greedy, and matched “code>&lt;br ”.  The fix below won’t handle unescaped greater-than signs in quoted attribute values.

Index: feedparser.py
===================================================================
--- feedparser.py	(revision 291)
+++ feedparser.py	(working copy)
@@ -1654,7 +1654,7 @@
     def feed(self, data):
         data = re.compile(r'<!((?!DOCTYPE|--|\[))', re.IGNORECASE).sub(r'&lt;!\1', data)
         #data = re.sub(r'<(\S+?)\s*?/>', self._shorttag_replace, data) # bug [ 1399464 ] Bad regexp for _shorttag_replace
-        data = re.sub(r'<([^<\s]+?)\s*/>', self._shorttag_replace, data) 
+        data = re.sub(r'<([^<>\s]+?)\s*/>', self._shorttag_replace, data) 
         data = data.replace('&#39;', "'")
         data = data.replace('&#34;', '"')
         if self.encoding and type(data) == type(u''):
Posted by Sam Ruby at

Add your comment