I’ve wanted to add an HTML Diff to Planet for some time, and the notion itself raises a number of interesting questions.
What should the latest version be compared against, the previously fetched version or the original version? I’d like to keep the interface as stateless as possible.
Does it make sense to ‘diff’ MathML? SVG?
In my investigation, I took a peek at Aaron Swartz’s HTMLDiff, which turns out to be a thin wrapper around Python’s difflib. My first test was to try taking a single word and make it bold: before, after.
Trying Aaron’s code on my machine produces the same results. Dropping down, and directly calling difflib also produces the same results - essentially all the effort to split the text into lists is ignored by difflib, and the offsets it returns is based on a concatenation of the strings, and treating each character separately.
My guess is that the semantics of difflib changed in some version of Python. I’m running Python 2.4.3.
That’s strange. I’m on 2.4.2, but I get the results I expect on a freshly compiled 2.4.3 as well (and I don’t see any changes in difflib between the two).
There are many more questions. The core reason diff works for text documents is that text documents have lines that have the property of being small enough to be treated as atomic changes and big enough that two lines in the same text are not very likely to be the same. If we choos to diff words instead of lines the algorithm does not work too good - it would spot two same words in two completely different sentences and it would match them. So you need to choose what would be your line in the HTML (there are pages that use \n rather sparringly). Another question is what HTML constructs do you filter out and how you treat differences in the tags alone.