intertwingly

It’s just data

Making encoding explicit


From the current iteration of EchoExample:

<content type="application/xhtml+xml" xml:lang="en-us">
  <p xmlns="...">Hello, <em>weblog</em> world! 2 &lt; 4!</p>
</content>

<content type="text/html" xml:lang="en-us">
  <![CDATA[<p>Hello, <em>weblog</em> world! 2 &lt; 4!</p>]]>
</content> 

<content type="text/plain" xml:lang="en-us">
  <![CDATA[ Hello, _weblog_ world! 2 < 4! ]]>
</content>

Looking at this, I am troubled by the implicit knowledge of encoding that is required.  Less than signs in XHTML are encoded once.  The same thing needs to be encoded (or wrapped in CDATA) twice for HTML.

How many times should title be decoded?  Aggregators today generally have to guess.  And such guesses have caused problems in the past.

We can't simply rely on <![CDATA[ ]]> as this may be already required in order to singly encode the data as XML to begin with.

I'd like to see a single element, say <encoded>, be introduced that makes such encoding explicit.  We can decide where such elements are allowed. We might even want to explore alternate encodings, such as base64Binary for handling things like pictures in archives.

Example:

<content type="application/xhtml+xml" xml:lang="en-us">
  <encoded>Hello, &lt;em&gt;weblog&lt;/em&gt; world!</encoded>
</content>