intertwingly

It’s just data

Mining Content For Value


James Snell: I need to get a really solid answer to a really simple question: do I parse out the (X)HTML into a hash or leave it as a String. Both are useful in different contexts although the String form is obviously more generic and results in a less complicated JSON serialization. Answer that question and I think this serialization will fall into the “Not terrible” category.

First, James: what JSON parser do you use?  I can’t parse that JSON with either the Python or JS JSON parsers....Update: Fixed.

Second: here's a proof of concept of a hash2value function in JS.  I’ll leave as an exercise for the student the creation of the inverse.  Bonus points for not using properties like innerHTML and creating a solution that can be implemented in a number of dynamic languages.

In any case, I’m a strong believer in Darwin in these matters.  I believe that the most interesting content is in the, well, content element.  If you guys want to store content as a blob, why not go all the way and store the whole Atom element as a string?  Meanwhile, I’ll continue to pursue data structures that make access to this data drop dead easy.

I’ll leave with a parting thought: if Prototype can add css selectors for a document, why couldn’t css selectors be defined on an JSON object?  Imagine how easy it would be to mine Atom documents (and not just for predefined atom elements, but also for extensions and even content) in that way?