It’s just data

Wordpress + Flash + RSS

influxproject: Please, my feed don´t appears as validated can somebody help me? This is the url: [link]

I’m at a loss what to recommend.  Can any reader handle this feed?  Is there something this user can configure or install to fix this problem?

A quick scan indicates that this defect may be related.


It doesn’t render in firefox and seems to break the feed preview; the select box of readers is incomplete and the text next to the checkbox is missing.

Google Reader manages to pull out the feed title and shows one entry with its title, author and url seemingly extracted correctly. Doesn’t seem to render anything where content would go though.

Posted by Iain at

It doesn’t render in firefox and seems to break the feed preview

Confirmed.  That seems bad; feed content shouldn’t be able to break the surrounding XUL content.  But then, feed preview is quite fragile (at least in Firefox 2), and it’s at least partially my fault.  (IIRC, there was a late-breaking must-fix accessibility bug in feed preview that necessitated some nasty hacks.  It’s probably not related to the current problem, but I’ve always felt a little bit guilty about it.)

Posted by Mark at

For the time being he should just edit the CDATA sections out of the post bodies.

They’re very nearly useless anyway – you need them only if you want to serve content as both text/html and application/xhtml+xml and you need to have a literal angle bracket or ampersand in the Javascript code. But as anyone reading this weblog knows that wanting to serve Javascript unchanged in both tagsoup and XHTML environments is a recipe for a rude awakening anyway.

Whoever put those CDATA markers in there was cargo-culting.

Longer-term, WordPress should just filter away that sort of bogosity at the time of posting.

Posted by Aristoteles Pagaltzis at

On the trunk, the UI chrome doesn’t get eaten, at least (and amusingly enough, the RSS1 feed actually gets a preview, though as a Live Bookmark it fails as a matter of policy rather than code necessity).

As to what to do... “I want to use RSS to transport content with absolutely no text, content which only exists in in the DOM of browsers with JavaScript enabled” sounds like a very hard problem to me, considerably harder than the escaping problem. I think I’d probably solve it by hacking wp-rss2.php to just skip <description> entirely, though there’s also a WordPress option (in Options / Reading) to use an excerpt rather than the post content in RSS, and the excerpt can be written by hand (I presume the auto-excerpting isn’t sharp enough to turn some CDATA-escaped JavaScript into something useful, though it might possibly at least turn it into something utterly useless but well-formed).

Posted by Phil Ringnalda at

The problem seems to be that we expect <![CDATA[content]]> to escape content always, but this isn’t true if content contains a literal ]]>. e.g. if content is “foo]]>bar” then we get <![CDATA[foo]]>bar]]>, which then evaluates to “foobar]]>”. If bar was markup, it’d be included literally.

Entity-escaping the > in ]]> may seem to work, but it’s not valid:

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using “&lt;” and “&amp;”. CDATA sections cannot nest.

So CDATA isn’t a general purpose escape mechanism, to escape “foo]]>bar” (assuming we want the benefit of CDATA for the rest of the text) we’d have to jump out of the section to include the ]]>, and escape it. For example: <![CDATA[foo]]>]]&gt;<![CDATA[bar]]>.

def cdata_escape(text)
   "<![CDATA[" + 
      text.gsub("]]>", "]]>]]&gt;<!CDATA[") + 
   "]]>"
end

cdata_escape("hello")      #=> <![CDATA[hello]]>
cdata_escape("foo]]>bar")  #=> <![CDATA[foo]]>]]&gt;<!CDATA[bar]]>

cdata_escape("<script>" + cdata_escape("1 & 2;") + "</script>")
#=> <![CDATA[<script><![CDATA[1 & 2;]]>]]&gt;<!CDATA[</script>]]>

Parsed once by the XML parser of the RSS reader, this last becomes
<script><![CDATA[1 & 2;]]></script>

Is this sufficiently non-obvious to make CDATA evil?

Posted by Sam McCall at

https://bugzilla.mozilla.org/show_bug.cgi?id=388275

Looks like my fault, not Mark’s. A case of an exception caused by a well-formedness error propagating a bit too far. Not an instance of bogus content actually corrupting XUL or running scripts or anything like that.

Posted by Robert Sayre at

This is a great example of why software should always do escaping and never use CDATA blocks. CDATA’s just about okay for humans to edit (as long as you’re careful!), but automated processes can’t guarantee that ]]> isn’t going to crop up in their (human-generated) input, and there’s no way to “fix” that without altering the content or switching to escaping.

Posted by Martin Atkins at

My Eddie Java feed parser manages to extract two entries before giving up. Of course it also removes the javascript, so you just end up with two empty <div>s.

Posted by JD at

Snarfer handles the feed ok, but only because we can parse a certain amount of broken XML (not nested CDATA sections). As has been mentioned already nesting of CDATA sections is just plain wrong and WordPress should know not to do that.

Posted by James Holderness at

Snarfer handles the feed ok, but only because we can parse a certain amount of broken XML (not nested CDATA sections).

To illustrate how Venus/Universal Feed Parser “sees” this feed, take a look at this reconstituted feed (essentially an Atom feed produced by as result of "one feed planet").

The root problem is that broken XML is treated as SGML by the UFP.  It most cases, this approach localizes any problems.  And, in fact, two of the six entries do survive this process intact.

Posted by Sam Ruby at

To illustrate how Venus/Universal Feed Parser “sees” this feed, take a look at this reconstituted feed

It looks to me like four of the six entries have been parsed ok. The two entries showing a bunch of css are actually correct - that’s what is in the feed. Or did you mean something else?

The only real problem I can see with the first two entries is that the summary element is using a type of text (implicit) rather than html. So I have to ask why are you using the text type? I know RSS permits a description to be either text or html, but I figured the standard practice was to just assume html all the time. Is the UFP doing some kind of HTML guessing like it does for titles?

FYI, the way Snarfer parses the feed, the first CDEnd string is considered the end of the CDATA section. The </script> that follows is considered an unmatched close element and is thus ignored. The second CDEnd string is considered to be more text content for the description element, however it never actually shows up in the final entry since it’s stripped out by the HTML cleanser when removing the script.

Posted by James Holderness at

The only real problem I can see with the first two entries is that the summary element is using a type of text (implicit) rather than html. So I have to ask why are you using the text type? I know RSS permits a description to be either text or html, but I figured the standard practice was to just assume html all the time. Is the UFP doing some kind of HTML guessing like it does for titles?

That’s not exactly what is going on.  It is relatively rare, but independent of any specification, some people put HTML directly in titles and descriptions, i.e., without any escaping; and occasionally such results even happen to be well formed XML.  After determining that the feed overall isn’t well formed, the feed is parsed with an SGML parser which doesn’t understand XML’s <![CDATA[ ]]> construct, and the result in two of these entries falls into this case.

If it helps, here is how the feed parser itself sees this feed.

Posted by Sam Ruby at

If it helps, here is how the feed parser itself sees this feed.

Remember how I once said that any feed that feedparser couldn’t parse should be treated as a bug?  i.e. I had no limits in my “race to the bottom” of parsing ill-formed, invalid, or otherwise semantically-challenged feeds?  Well, I lied.  It took you several years, but you have finally found the counterexample.  I give up.

Of course, I’m very sick today, so maybe tomorrow I’ll feel differently and commit a magical change that teases meaning out of this steaming pile of angle brackets.  But don’t hold your breath.

Posted by Mark at

Looks like my fault, not Mark’s.

Always nice to hear. ;-)

OK, going back to bed now.

Posted by Mark at

If it helps, here is how the feed parser itself sees this feed.

Yeah. I think I get it now.

Posted by James Holderness at

I was about to show the firefox breakage to a colleague, but the feed seems to have been “fixed”, at least to the point where ff can pull out 6 entries.

There’s not much sanity within those entries, of course...

And wow, putting an iframe tag after the rss tag. I can’t even see what that would ever be supposed to mean?

Posted by Iain at

And wow, putting an iframe tag after the rss tag. I can’t even see what that would ever be supposed to mean?

A search for the URL that the iframe points at indicates it’s added at the end of PHP files on exploited hosts. [link]

Posted by Sam McCall at

Add your comment