It’s just data

HTMLDiff

I’ve wanted to add an HTML Diff to Planet for some time, and the notion itself raises a number of interesting questions.

In my investigation, I took a peek at Aaron Swartz’s HTMLDiff, which turns out to be a thin wrapper around Python’s difflib.  My first test was to try taking a single word and make it bold: before, after.

Easy enough?  See for yourself.


Looking at the result, I think it would do better with a root element.

Posted by Sjoerd Visscher at

I’m guessing that’s a bug in Aaron’s presentation. It works just fine - as far as I can tell - if I try running the same thing on my box.

Posted by Filip Salomonsson at

Trying Aaron’s code on my machine produces the same results.  Dropping down, and directly calling difflib also produces the same results - essentially all the effort to split the text into lists is ignored by difflib, and the offsets it returns is based on a concatenation of the strings, and treating each character separately.

My guess is that the semantics of difflib changed in some version of Python.  I’m running Python 2.4.3.

Posted by Sam Ruby at

That’s strange. I’m on 2.4.2, but I get the results I expect on a freshly compiled 2.4.3 as well (and I don’t see any changes in difflib between the two).

Posted by Filip Salomonsson at

I use a modified version of [link] which works better, IMHO, and it’s in JS.

Posted by Adriaan Tijsseling at

[from miyagawa] Sam Ruby: HTMLDiff

Hm, syncronicity. We plan to add this to Plagger Planet as well....

Excerpt from del.icio.us/network/saltyduck at

There are many more questions.  The core reason diff works for text documents is that text documents have lines that have the property of being small enough to be treated as atomic changes and big enough that two lines in the same text are not very likely to be the same.  If we choos to diff words instead of lines the algorithm does not work too good - it would spot two same words in two completely different sentences and it would match them. So you need to choose what would be your line in the HTML (there are pages that use \n rather sparringly).  Another question is what HTML constructs do you filter out and how you treat differences in the tags alone.

I’ve tried to answer those questions when writing: Yet another HTML diff

Posted by Zbigniew Lukasiak at

Add your comment