It’s just data

Detecting Not Modified Reliably

Yesterday, I more fully integrated Joe’s threading work into Venus.  From an end user’s perspective, one benefit of this is that the first time you specify spider_threads, you will see immediate benefit as the Last-Modified and Etag header values that had been previously captured and stored in the Venus cache will be used.

With this change, the HttpLib2 cache becomes optional, but may soon provide additional benefits.

In debugging this, I took at look at ETag and Last-Modified usage, and found a few surprises.  Sure, I found a few sites that provided neither, yet would return the same data, byte for byte, again and again.  Most of these sites appeared to compute their feeds dynamically on each request; this includes sites such as IBM Developer Works: example.

As I said, this wasn’t a surprise.

Some sites, most typically ones powered by WordPress, would provide both ETag and LastModified headers, but would always provide the full content if an If-None-Match header was provided, but would respect If-Modified-Since, but only if If-None-Match was not provided.  These sites are typically ones powered by WordPress: exampleAnne’s feed also falls into this category, but I can’t determine how it was produced.

While that was surprising, even more puzzling is the fact that there are some feeds out there that intermittently support Etags.  And by intermittently, I do mean occasionally, like anywhere from one time in two to about one time in four or less.  All such feeds that I could find come from blogs.msdn.com: example.

You can verify this yourself with the following Python 2.4 script.  Simply pass one or more URIs as command line parameters:

import urllib2, sys

for uri in sys.argv[1:]:
  if uri.startswith('-'): continue
  headers = urllib2.urlopen(uri).headers
  request = urllib2.Request(uri)
  if headers.has_key('etag') and '-e' not in sys.argv:
    request.add_header('If-None-Match',headers.get('etag'))
  if headers.has_key('last-modified') and '-m' not in sys.argv:
    request.add_header('If-Modified-Since',headers.get('last-modified'))
  try:
    print uri, urllib2.urlopen(request).code
  except urllib2.HTTPError,e:
    print e

Additionally, you can specify either or both of -e and -m to cause the associated header to be omitted.

What you want to see is HTTP Error 304: Not Modified.  If, instead, you simply see 200, then the full content was sent both times.

Conclusions

Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it.  But if you can support it, please do.  It will save you some bandwidth and your readers some processing.

And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time.  I’ve now implemented this for Venus.

Update: WordPress ETag bug [via David Terrell]


It’s a bit strong to tell people to only use ETags and Last-Modified when they support validation; they can be used for other things as well.

LM is useful for calculating heuristic freshness in caches, if the server hasn’t bothered to make it explicit.

ETags can be used for optimistic concurrency on writes; see [link]

Besides, validation is an optimisation; it doesn’t cost the cache any more than an extra header to try it out, and the Web still works correctly without it. Of course, it’s very helpful when it is supported.

Not sure what’s happening with Anne’s feed, but FWIW the Last-Modified date I see is malformed (it’s missing GMT). That will mess up some clients...

Posted by Mark Nottingham at

And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time. 

Are you hashing the entire stream returned by the server or are you hashing the content of individual nodes in the XML returned?

Posted by Dare Obasanjo at

It’s a bit strong to tell people to only use ETags and Last-Modified when they support validation

I’m confused.  Where did I say that?  If by supporting validation you mean supporting headers like If-None-Match, then your reaction surprises me.

Let’s look at a scenario: I fetch a feed.  In the headers, I get both an ETag and a Last-Modified header.  As a respectful consumer, what headers should I send on my next request?

Posted by Sam Ruby at

Are you hashing the entire stream returned by the server or are you hashing the content of individual nodes in the XML returned?

The HTTP Message body, i.e., the stuff that would be passed as input to the feed parser.  Alternate suggestions welcome.

Posted by Sam Ruby at

I believe the intermittent etag support will be from the ASP.NET caching infrastructure, where the etag support only works while it has a cached copy of the response in memory, as soon as that cache is flushed (typically on time) it’ll regenerate the feed from the original ASP.NET code, and recache, so you’ll see new etags etc. even thought the content hasn’t changed. Its kinda dumb, which is why my .NET blogging engine generates a real file for the feed and relies on IIS’s etag support.

Posted by Simon Fell at

[from gregorrothfuss] Sam Ruby: Detecting Not Modified Reliably

“And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time.” i can confirm that ;)...

Excerpt from del.icio.us/network/manuel at

Yeah, that seems to be the case for Mike’s feed

GET /mikechampion/atom.xml HTTP/1.1
User-Agent: curl/7.15.3 (i586-pc-mingw32msvc) libcurl/7.15.3 OpenSSL/0.9.7d zlib/1.2.2
Host: blogs.msdn.com

HTTP/1.1 200 OK
...
X-Powered-By: ASP.NET

Posted by Simon Fell at

[link] contains information on the code I used. PHP trickery.

Posted by Anne van Kesteren at

I was actually looking at exactly this problem in wordpress when this post showed up in my aggregator.  Some lameness in that code.

Posted by dbt at

Bloglines does both ETags, Last Modified, and multiple levels of hashing.

For hashing:
1) hash whole feed/http body if different than last, continue.
2) Parse feed into objects, hash contents of objects, if different from last, continue.
3) Some detection of bad Content Producers who modify every item every time you fetch it (such as including a timestamp in an <!-- escaped area)

Posted by Paul Querna at

Ok, the wordpress code is seriously lame.  The ETag itself is just a hash of the last-modified and it’s getting corrupted by their retarded string escaping.  I just removed the header("ETag...") from line 1637 of classes.php (as of wordpress 2.0.5) and I’ll just let last-modified do its thing.

Posted by dbt at

Hey Sam,

You said: “Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it.  But if you can support it, please do.  It will save you some bandwidth and your readers some processing.”

Some people will read that as “don’t send ETag and Last-Modified unless you support validation on them.” Note that I was talking about server, not client, behaviour; I was just pointing out that these headers can be used for other things too.

BTW, a lot of the time you’ll see validation not seeming to work (especially with ETags) because the server is actually a farm, and they’re not syncing their metadata. In the case of Last-Modified, this can happen when there are clock sync problems; for ETags, it’s often because Apache uses the inode to calculate the ETag’s value, by default, and it’s different across the farm. See: [link]

Cheers,

Posted by Mark Nottingham at

I can’t reproduce the reported results for WordPress with several of the latest versions: [link]

Posted by Morten Frederiksen at

[from kellan] Sam Ruby: Detecting Not Modified Reliably

[link]...

Excerpt from del.icio.us/network/rabble at

Sam Ruby: Detecting Not Modified Reliably

Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing....

Excerpt from Public marks with search pim ruby at

Just curious (with a cowardly cop-out that it’s late and I’m well into a few bottles of Timothy Taylor Landlord) but what’s so bad about using the value of the Last-Modified header to generate an ETag value? That seems reasonably sensible in my current state of thought...

Posted by James Abley at

Sam Ruby: Detecting Not Modified Reliably

save processing time and bandwidth for unmodified resources with ETags and Last-Modified support...

Excerpt from del.icio.us/jedws at

RSS/Atom feeds, Last Modified and Etags

Sometime last week I read this piece by Sam Ruby, which summarized says this:

…don’t send Etag and Last-Modified headers unless you really mean it.  But if you can support it, please do.  It will save you some bandwidth and your readers some p...

... [more]

Trackback from Aaron Johnson at

links for 2006-11-24

From the blogroll… Bellevue Schools wikify their entire curriculum Syndication to mobile devices Practical JavaScript lecture Consumers yawning over HDTV? Detecting Not Modified Reliably...

Excerpt from The Robinson House at

Four Verbs Good, Two Verbs Better?

Much as I enjoyed Why PUT and DELETE, I have to question Eliotte’s advice. When crafting a Web API, it’s worth knowing when to use GET over POST, and understanding the value of eTag is going to reap rewards, but why would a publisher...

Excerpt from Paul Downey at

RSS/Atom feeds, Last Modified and Etags

Sometime last week I read this piece by Sam Ruby , which summarized says this: …don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some...

Excerpt from Struts on SWiK.net at

Add your comment