It’s just data

APP Level Patch

Joe Gregorio: At Google we are considering using PATCH. One of the big open questions surrounding that decision is XML patch formats. What have you found for patch formats and associated libraries?

I believe that looking for an XML patch format is looking for a solution at the wrong meta level.  Two examples, using AtomPub:

My recommendations on how to proceed

Initially severely limit the set of vocabularies that you are considering.  In this sense “XML” is simultaneously too broad, and too limiting.

Yet another example of the problem with coming up with solutions that are too generalized.  Focusing either on a specific subset of the xml-diff problem, or as Joe suggests, moving to an approach that effectively translates the document into a stream of sax-like events would greatly simplify the problem.  I do think, however, that in order to keep the solution sufficiently flexible enough to properly handle Atom extensions, the solution needs to be kept at the XML Infoset level as opposed to something that works on the Atom level.

I’m not convinced that using a standard diff format will work, however, tho it’s definitely something worth considering.

Posted by James Snell at

Clearly the serialization format would need to be able to express the infoset of extensions.  That does not mean that known Atom elements need to be expressed at the infoset level.  There may be bandwidth savings there.

A serialization of sax-like events is very verbose.  Enough so that the serialization of diffs may often exceed the size of the data being diff’ed.  Languages like Python and formats like YAML can improve upon this by leveraging indentation to convey information.

Posted by Sam Ruby at

Does your sax-like event criticism extend to my general-purpose JSON solution? I mean, I didn’t even try to make it small.

Posted by Robert Sayre at

Not so much.

With Unix diffs, adding a few lines to a file requires a constant overhead per block of change plus a single byte per line.  Actually, the overhead per block isn’t necessarily constant, as the overhead varies by the number of digits in the line number, but that’s relatively noise, and only increases logarithmically with respect to file size.

With your JSON diff (and for that matter, James’s XQuery inspired diff format), the overhead is constant per individual change plus the size of the path to get to that change.  If the use cases tend towards to changes being clustered, you might want to flip either proposal so that the path is determined by nesting.  And as Roy points out, YAML is ideal for such things.

But like the interesting questions that everybody ignored about cardinality; I still think that there is an interesting question here that makes the above question of a seconary importance.  If a typical resource contains 35% metadata and 65 data, the latter being typically in a markup language that is neither JSON nor XML, do changes to the data portion of the resource require complete replacements, or can data be patched with the same ease as metadata?

Posted by Sam Ruby at

That point about path overhead is a good one, though perhaps near irrelevant given some compression.

My reply to Roy mentioned one way of updating large text blocks, and finding changes in large text blocks has been done in many different ways. Or are you suggesting that HTML fragments be directly addressable?

Posted by Robert Sayre at

First a sobering question: what percentage of internet traffic today is PUT?

Now make a leap of faith, and assume that Joe’s question is likely to be something he would want to apply to AtomPub.  All in all, that doesn’t seem all that unlikely given his history.

What’s the likelihood that a hybrid JSON/text approach (and all of the attendant mapping requirements that implies) plus a new patch format plus JPath would achieve more than token interoperability and wide enough deployment that even a small fraction of today’s PUT traffic would be replaced by PATCH?  If you assume that clients will have to continue to support non-PATCH enabled servers, I believe that we are so far into YAGNI territory that the entire question is moot.

On the other hand, create a serialization format that intelligently employs white space (by intelligent, I don’t mean that one where startElement updated, a timestamp, and endElement updated takes three lines) and it will attract its own advocates and unintended applications.

This format could a constrained subset of XML.  Or JSON.  Or YAML.  Or perhaps even a combination.  Simply make sure that it can be parsed with either a “real” parser or with RegExs, register a mime type (e.g. application/atom+json) and step back.

Then apply a text based diff to that format, and you have effectively untangled this intractable problem into two simpler ones.

Posted by Sam Ruby at

Well, I am interested in messier things, rather than AtomPub. I worry about round trips induced by Etag mismatches on PUT and conflict avoidance at all costs, more than I worry about the size of the request. Maybe Joe’s needs require a different format.

I think a YAML approach like the one you’re exploring is a great idea, and experience from each effort will be valuable. And, just maybe, they will converge. A requirement that the format be RegExable would seem to rule out arbitrary “infoset” formats on JSON or XML or YAML, since nesting will get tricky. That might be OK.

I also don’t mind POST. :)

Posted by Robert Sayre at

Well, I am interested in messier things, rather than AtomPub.

Me too.  A canonical JSON+whitespace representation for Atom would enable a lot of serendipity.

Posted by Sam Ruby at

Perhaps something like this?

Posted by James Snell at

More on XML Diff

I was thinking some more about the XML Diff problems being posed lately, and recalled that Zhang and Shasha published some results on calculating differences for trees. I ran into Daniel Ehrenberg’s discussion of the general problem. Daniel’s post...

Excerpt from The Eighty-Twenty Solution at

PATCH is useless until there is a usable accepted patch document standard.  I can understand practical considerations, wanting to start by scoping down, KISS, but ultimately there needs to be a good solution for XML diff work, and the solutions needs to be in XML.

JSPON has always been my ideal model for updating javascript objects, in that it allows referential integrity for javascript objects.  XML is inherently a lot more complex than JSON, but I like Joe’s starting place of using XML-ID for references.

Posted by rektide at

PATCH is useless until there is a usable accepted patch document standard.

There are usable patch document standards for document formats other than XML.

...and the solutions needs to be in XML.


Posted by James Snell at

More Thoughts on an HTTP PATCH and AtomPub

... [more]

Trackback from Dare Obasanjo aka Carnage4Life


More on XML Patch formats

With regards to comments here and here… Joe suggested the possibility of using a serialized form of simplified sax-like events, then using a standard diff to compare two different serialized streams. He’s kicking around his own prototype, but I...

Excerpt from at

PATCH motivating examples

Sam Ruby : Spend some time up front specifying the behaviors that you want to address.  In the case of Atom, adding an entry, deleting an entry, adding a category to an entry, fixing a typo in the content are examples of common scenarios.  Feel...

Excerpt from BitWorking | Joe Gregorio at

Having read matching diffing and merging xml I can see it’s a hard problem.  If you go for a pure XML diff, then you are going to need to sort the ordering problem, perhaps via a schema specific pre-process step.

If you go for pre-processing (like ordering the elements, then putting each on it’s own line), then using gnu diff, you’ll hit another problem.  The ordering of attributes in XML is explicitly unordered.  In some libraries serialising the same document to disk several times will result in a different attribute order, which of course will fox gnu diff.

XML Dig Sig gets around this by using XML Cannonicalisation to specify an unabigous order for attributes.  The problem is that even once you’ve done the cannonicalisation, you’ve lost it once you write to disk.

This suggests that a non-XML approach has merit.

Posted by David Roussel at

Add your comment