Tom Pike: Question: Does Google support pages sent as application/xhtml+xml?
No.
Call this anecdotal if you like, but a few days ago I mentioned two products which have been widely reviewed, and yet my entry appears fairlyhigh on the search results. Pleasingly, traffic is starting to flow to that post, particularly when these proct names are combined with the word “Ubuntu”. I hope that these people found something that they consider useful.
The day after that, I referenced a new tool created by Morten Frederiksen. Echoes of my post were picked up by Google, but the original was not. Joe’s post (served as faux XHTML to Google), however, does rank highly — what’s up with that?.
But mostly what will drive the transition to Python 3K is that people will start writing code that only works w/Python 3.0. I don’t begin to presume that I have enough clout to move the mighty Google to action, but perhaps if the folks at the W3C who authored or supported the XHTML standard had created enough meaningful content and served it with the proper media type, Google (and Microsoft!) might have put XHTML support a bit higher on the priority list.
Update: now my post is top of the search results. Manual intervention? Google dance? I’ll probably never know...
[from wearehugh] Sam Ruby: Confirmed: Google Hates XHTML
It’s not surprising that they ignore XHTML. Google is an advertising company! There’s currently zero benefit<sup>*</sup> to companies switching to XHTML. Those that have are practically guaranteed to be serving as text/html anyway, so why should Google grok application/xhtml+xml? (Of course that answer is “it’s danged easy to parse, so why not?!?")
For most intents and purposes, XHTML has failed, and it’s a damned shame it has. Until Microsoft produces a user agent that can accept application/xhtml+xml (and preferably lists application/xhtml+xml in its HTTP_ACCEPT header) the incentive for supporting it remains low.
<sup>*</sup> please read as "practically zero perceived benefit”
Until Microsoft produces a user agent that can accept application/xhtml+xml (and preferably lists application/xhtml+xml in its HTTP_ACCEPT header) the incentive for supporting it remains low.
I don’t believe that it is fair to pin this on Microsoft. I believe that if those that had created XHTML had the courage of their convictions, both Google and Microsoft would have had no choice.
I also believe that there should have been a maintenance release or two of HTML4. In HTML5, the root element MAY have an xmlns attribute, but only if it matches the one defined by XHTML; and void elements may have terminating slash characters in their start element.
It is these small touches that make transition easier.
(Of course that answer is “it’s danged easy to parse, so why not?!?")
Things like internal subset, namespaces, et cetera make XML not that easier to parse than HTML I think. I suppose the tree construction phase is slightly less involved, but I doubt there’s much difference overall now we have a specification on parsing text/html.
Things like internal subset, namespaces, et cetera make XML not that easier to parse than HTML I think. I suppose the tree construction phase is slightly less involved, but I doubt there’s much difference overall now we have a specification on parsing text/html.
Anne, why would they parse it as XML? Just tokenize it like I imagine the must currently do for HTML. Google doesn’t care about the structure of the document, just the words contained within.
Things like internal subset, namespaces, et cetera make XML not that easier to parse than HTML I think. I suppose the tree construction phase is slightly less involved, but I doubt there’s much difference overall now we have a specification on parsing text/html.
Google is perfectly happy indexing Sam’s atom feed (with its <content type="xhtml">). If they can handle his XHTML content, when sent as application/atom+xml, there’s no reason they couldn’t handle the same content sent as application/xhtml+xml.
Bill, if you would do it that way you wouldn’t be able to properly handle tags such as foo:html where foo is bound to “”.
Pardon me if I’m being dense, but I’d still assume that they don’t parse at that level. I’m under the impression that they’d just parse the page for words, and put the page into their huge word<->document matrix. They don’t need to know what <foo:html> means, just store the important tokens and throw away the ones in your stoplist.
Summary First hit counted 16 hours, 17 minutes ago Total hits: 2000 Page hits: 343 (21 per hour) Last page request: / details Processing time: 18 ms Recent popular pages (5 or more requests) /static/medmen/lib/exe/css.php: 97...
I assert that Google does care about the structure of the document, at least to the point that <a href="http://example.com/"> is relevant to their page rank algorithm.
Beyond that, the assertion that a draft WHATWG document, which concerns itself about minutia such as the content type of titles, somehow retroactively makes Google’s job easier, well that is somewhat absurd. What it hopefully will do, years down the road, is make more people apply HTML consistently; and that will make Google’s job easier.
It seems Apple are asserting IP claims on the <canvas> element, which apparently did come from Apple but has since been taken up by the WHATWG with implementation support in Opera and Moz. As Arve says : "This has the potential to make people...
Bruce Sterling gives blogs 10 years to live , “SXSW Science fiction writer and professional pundit Bruce Sterling has cracked bloggers with the extinction stick, saying the plebs will crawl back into their ooze by 2017.” It won’t...
Joe’s post, however, does rank highly — what’s up with that?.
Could it be that Google is DOCTYPE sniffing, and prefers Joe’s “XHTML 1.0 Transitional” Doctype over your "html"?
Pardon me if I’m being dense, but I’d still assume that they don’t parse at that level. I’m under the impression that they’d just parse the page for words, and put the page into their huge word<->document matrix.
I’ve always heard that Google puts more weight on words in <h1> and <title> tags, and that all-other-things-equal prefers semantic documents to non-semantic ones, so I think they are doing more than just pulling out the words and tossing the tags.
I’m not sure Google actually looks at links. There is at least on search bot out there that follows just everything that looks like a URL: The logs of xopus.com are full of xpaths.
Hi, I’m an engineer at Google. I meant to stop by several days ago (sorry that it took me a while to get here). I chatted with a crawl person, and he said that Google should handle xhtml+xml pretty well.
It can take a few days for Google to crawl/index/rank individual pages well. I really don’t think that there wasn’t any manual intervention in this case at all, but feel free to drop me an email if you’d like to discuss it more. Jacques Distler said a few days ago that this post doesn’t rank for “google hates xhtml”, but now it does, for example (and Google didn’t do anything special for this post). I think sometimes search engines just need a short time to find/crawl/index/rank a page well.
Sam, I’m checking into the specifics of why we show that. IE6 doesn’t seem to handle that file type that well (offering to download the page as a file), but I think Google could still present the link better (e.g. that “Unrecognized” is unfortunate).
Out of curiosity, how would you want that snippet to look in your ideal world? Maybe for browsers like Firefox that handle the page fine, just not even tell the user that it’s a different type of file?
Out of curiosity, how would you want that snippet to look in your ideal world? Maybe for browsers like Firefox that handle the page fine, just not even tell the user that it’s a different type of file?
Let’s make it interesting. I’ve restored content negotiation and enhanced my regular expression. A few examples:
Net result: the same content (with negotiated metadata) is sent to every browser from Lynx to Firefox 2.0, and each displays the content to the best of their abilities. Lynx won’t understand any of the images; IE6 won’t understand some of the CSS or any of the SVG but will display ads as Google adsense requires document.write; and browsers that support SVG will get the MIME type they need to trigger SVG support but, as a byproduct, they won’t see ads.
Based on this, I would suggest that Google not tell any user that it is a different type of file.
Question: what accept header is sent by the Googlebot crawler?
I just tried it on MSIE 7.0, and got similar results.
Since my wife and I don’t use IE except when we have to, I don’t believe either of us have intentionally installed any plugins.
As to the purported uselessness of the Accept header, do you have any alternate suggestions on how to serve web pages which contain inline SVG to a variety of browsers in a way that gracefully degrades?
Checking if application/xhtml+xml occurs in an Accept header with a different q value from 0 should probably work fine in the majority of cases. Plus maybe specifically sniffing for Safari. Or you could use your script to inject the SVG dynamically and just serve everything up as text/html.
The only reason I can see is to disable sending application/xhtml+xml to that particular XHTML-UA.
I disable sending application/xhtml+xml to Safari for S5 slideshows (no inline SVG for Safari-users!) because its support for Javascript in XHTML is completely broken.
Otherwise, the usual Accept header logic works just fine for Safari.
because its support for Javascript in XHTML is completely broken
If a Safari user out there could confirm the value of HTTP_ACCEPT on this page, I would appreciate it.
Also, if anybody notes ways in which my weblog does not gracefully degrade for Safari users please let me know, as I include both SVG and JavaScript in my pages, but both are done in ways that are intended to gracefully degrade.
The bug comments seem to indicate that it works on trunk. Can you confirm? And if so, who cares about Firefox 2? No one of importance runs release builds of a browser.
No one of importance runs release builds of a browser.
I did consider giving it the full CADT treatment, and resolving it worksforme, but then I realized that just kicking it out of my product would work as well, while giving me a thin veneer of productivity.
Until today, I had been following the recommendation of the W3C Validator and serving the XHTML pages of this blog as application/xhtml+xml (to all clients except Internet Explorer ). Unfortunately, it appears that Google just doesn’t like to index...