It’s just data
Interesting. It strips out everything not in the XHTML namespace so it would pull out any SVG or MathML, for example, embedded on a page. I'm not sure, from a practical point of view whether it's better to have content missing entirely where it can't be expressed in HTML, or to have junk that strongly indicates that something is missing.
Obviously the ideal solution it to replace the missing content in the HTML version, but this isn't always possible. I have yet to find a good way of substituting for MathML, for example (a fact which I announce at regular intervals in the hope that one day someone will announce they have just written a MathML to PNG converter).
Posted by jgraham atjgraham, you could convert your MathML to TeX (using MathParser, TeXmml, XSLT, or whatever), and thence to PNG with WikiTeX.
Joseph's attitude appears to be "HTML is for people who can't guarantee well-formedness, XHTML is for those who can". I think that's rather futile: All other things being equal, a text-based format's insistence on well-formedness will be inversely proportional to its popularity. (If that means few people ever use XHTML 2.0, that's fine by me. As for Atom, on the other hand...)
Posted by mpt atExperience has shown me that no one can guarantee XML well-formedness. Either it's too hard, or I've surrounded myself with incompetent fools. Or both.
Posted by Mark atExperience has shown me that no one can guarantee XML well-formedness. Either it's too hard, or I've surrounded myself with incompetent fools. Or both.
Depends what you mean by "guarantee."
I would never want to guarantee well-formedness. But the folks at The String Coffee Table, for instance, get along OK: dozens of posts, hundreds of comments, all well-formed (even valid!) XHTML+MathML.
And not a single one of them could tell you what the phrase "well-formed XML" means.
Posted by Jacques Distler atJacques; the main distinguishing feature of the String Coffee Table is that it was set up by professor of theoretical physics, who by all avaliable evidence has a better grasp of how to ensure that his markup remains well formed than most web geeks and has a real need to be well-formed so that the content actually displays. Since neither of these things are true in general, I tend to agree with mpt that requiring well-formed markup will be bypassed by people who perceive it as providing, of itself, no benefits whilst introducing substantial risk.
And I like XHTML 2
mpt; I'm aware of LaTeX to gif/png tools, but it would be really nice to avoid converting from IteX to MathML, back to LaTeX and then finally to gif. I suppose in simple cases the LaTeX to gif tools would work with the IteX, but in general I would expect it to bork on the syntax differences.
Posted by jgraham atI'm not sure what Mark means by guarantee, but as near as I can tell any web site that uses the proper xhtml mime type is making some form of assertion on this matter.
By that criteria, jgraham, Jacques Distler, and this blog qualify.
I even take rather extreme measures in an attempt to ensure that my weblog remains valid.
Posted by Sam Ruby atI'm not sure what Mark means by guarantee, but as near as I can tell any web site that uses the proper xhtml mime type is making some form of assertion on this matter.
Yes. That was the first part of what I was getting at.
By that criteria, jgraham, Jacques Distler, and this blog qualify.
But we, as James would have it, are on the outliers of web markup geekery.
My other point was that, with decent software, you don't need to be a markup geek to have a reasonable assurance (if not a guarantee) of producing well-formed XML (XHTML+MathML).
That's what the folks at the String Coffee Table are doing and, yes, while it was set up by me, it doesn't require any intervention on my part to keep it running (which, by definition, means keep it well-formed).
Posted by Jacques Distler at"...any web site that uses the proper xhtml mime type is making some form of assertion on this matter." Sam is dead-on right. Although as the X-Philes experiment proves, this assertion is far from a guarantee.
I'm wondering whether XHTML 2.0 is going to allow for MIME-type negotiation, or simply require application/xhtml+xml in all circumstances. I can't seem to find any clear statements from the W3C on this, and it would be very interesting to know the answer.
Posted by Evan Goer atEvan, I pretty sure that there was a discussion on www-html about mime types that started from the premise that all XHTML2 documents were going to have an XML mime type. The discussion itself was whether this should be application/xhtml+xml or something else. The WG seemed quite in favour of the former and the people who wanted to content-type-negotiate XHTML 1.x and XHTML 2 the latter, as I recall. It's a while since I read any of www-html so I could be entirley wrong of course. I'd be very surprised if anything other than a pure XML MIME type was allowed, given that XHTML 1.1 SHOULD NOT be served as text/html and SHOULD be served as application/xhtml+xml and XHTML1 is seen as the 'easy' backward-compatible precursor to XHTML2.
Posted by jgraham atThanks James, I suspected as much.
I must say, XHTML 2.0 must be solving some kind of super important problem if it's worth forcing every designer to be, err, Jacques or James or Sam.
On a related note... I, for one, am keenly interested in what the browsers of the future will do with the following code:
<HTML>
<title>My Web Pgage!
<BLOCKCODE>
echo 'J00 has b33n 0wned~! LOLOLOL!'
</BLOCCKCODE>
</html>
Nah, just kidding about being keenly interested. We all already know the answer.
Posted by Evan Goer atOK, so we have established that somebody who neither has contributed to this thread — or even been referenced by this thread — has an archive from two months ago that is invalid. News at 11.
I'll note that there is a difference between a statement to the fact that incompetent fools exist and a statement that you are surrounded by them.
Posted by Sam Ruby atSam, according to your own flamage rules, you should strike out or delete your own comment.
The invalid page in question was the subject of my "Thought experiment", which jgraham referenced.
PS - Your moderation system thinks I'm a spammer for some reason.
PPS - my site is back up. It was down due to an obscure conflict in last night's kernel upgrade mandated by the hosting provider of my UML-virtual-hosted system. Totally beyond my control, other than switching hosting providers. What does that have to do with anything?
Posted by Mark atActually, my flamage rules can be paraphrased as "if you want to flame somebody, do it on your own weblog".
I find that most invalid XML is due to obscure conflicts that most users find to be totally outside of their control, other than switching weblogging software. I see an analogy here, but perhaps this is just me.
And the "some reason" is clearly specified in the warning message: rapid succession of posts from the same source. Don't worry about it.
I however, will note that you are correct, and that jgraham did reference your thought experiment which did reference that archive page. I will also note that Nick's current page is well formed, if not valid.
Posted by Sam Ruby atmy site is back up. It was down due to an obscure conflict in last night's kernel upgrade mandated by the hosting provider of my UML-virtual-hosted system. Totally beyond my control, other than switching hosting providers. What does that have to do with anything?
Presumably, Sam was trying to point out that one's site might be unavailable for any number of reasons. Ill-formed XML, or a kernel upgrade, ... unavailable is still unavailable.
By unhappy coincidence, while you were having your unscheduled downtime, the front page of my blog was briefly unavailable due to ill-formed XHTML. I'd neglected to sufficiently bullet-proof the content I receive from the Technorati API, and the URL of a new member of my Technorati Link Cosmos contained an ampersand ...
I'd venture that this was easier to fix than your problem.
Posted by Jacques Distler atthe front page of my blog was briefly unavailable due to ill-formed XHTML
This obviously would never happen if you used a real XML serializer to create your XHTML pages.
Due to a combination of my bad planning and lack of Perl skills, it is possible for my b-links feed to become ill-formed if I cut-and-paste a URL with an unescaped ampersand. This happened recently and a kind reader suggested that I should use a real XML serializer. I briefly toyed with the idea of running a combination of my feedparser and libxml2 on a cron job to reserialize my feed into UTF-32, but thought better of it and wandered off.
Meanwhile...
Experience has shown me that no one can guarantee XML well-formedness.
I stand by this statement.
Posted by Mark atthe front page of my blog was briefly unavailable due to ill-formed XHTML
This obviously would never happen if you used a real XML serializer to create your XHTML pages.
I use Adam Kalsey's Technorati plugin. Does your Python implementation do things better?
In the case of Adam's plugin, it was a matter of changing several instances of <MTTechnoratiLinkURL> in my templates to <MTTechnoratiLinkURL safe_url="1">.
The problem will not recur.
Due to a combination of my bad planning and lack of Perl skills, it is possible for my b-links feed to become ill-formed if I cut-and-paste a URL with an unescaped ampersand.
Anything that involves hand-coding is prone to failure. Of course, if you are doing things by hand, and serving application/xhtml+xml, you will notice immediately that you have erred. The reason The String Coffee Table works is that all hand-entered data (Entries and Comments) are run through the Validator before being posted.
For data that is machine-processed (Trackbacks, Technorati API output, syndicated RSS feeds, ...), one needs more robust bullet-proofing, because there's not going to be a human around to fix it if it's ill-formed.
I stand by this statement.
Who said they could guarantee well-formedness 100% of the time? That's as unlikely a claim as your guaranteeing 100% uptime for your site.
Making one's XHTML setup more robust is the name of the game. In composing this comment, I ran into an instance of ill-formedness in Sam's preview function (when I tried to nest blockquotes using ">>"). Doubtless, Sam will fix this ...
Posted by Jacques Distler atEvan: Browsers of the future will reject that document for poor spelling. Accepting documents with poor spelling could lead to accidental misunderstanding between communicating parties. Since correct spelling is so easy to achieve, an instance of misspelling may point to a bug in the content producing entity. In this case, it is clear that the content should not be trusted until the bug is fixed and it can be resent with the correct spellings. Additionally, accepting content with poor spelling places an unacceptable burden on clients to learn the liberal word parsing algorithms which are needed to understand misspelled words. Client software will optionally be allowed to reject text which uses words on contexts not allowed by the rules of grammar.
Jacques: My problem with a strict XML future is that I don't see where the benefit is (unless you're doing something like MathML which, at present at least, only works in a strict XML context), but I see the real possibility of XML errors taking out sites.
You were able to fix your problem because you are using a visible-source CMS which you have the necessary skills and experience to hack on. If some company decides to go XHTML on a large site where they're using some closed-source CMS and they find their visitors are getting parsing errors rather than the content, there's often not a lot they can do about it until the vendor sends an engineer to fix the problem. In some cases they can switch the content type back to text/html. But if you're serving stuff that works as text/html, what was the benefit of sending XML to the client in the first place?
I suppose it could work if everyone started using bug-free XML tools. However the number of XML based CMSs available at present seems to be small. So people will have to shift tools in the future. What's not clear to me is why anyone would change to a tool that offers more risk with (typically) no benefit.
Of course, this doesn't preclude people using XML internally for data storage and transformation. Doing so, but sending text/html over the web would give you all the benefits that XML is supposed to offer without the trauma associated with using it as a distribution format.
Posted by jgraham atI am too lazy and forgetful to make sure the URLs I copy and paste are properly escaped.
So I wrote a MT-plugin to do it.
[link]
Apply safe_urls = 1 to a MT tag and all A tags inside get their HREF properly escaped.
It's all about being lazy :)
Posted by Gavin atGavin, that's very cool. I will definitely look into that. Does the plugin only handle a href, or can it be expanded to also handle
- area href
- link href
- img src
- img longdesc
- img usemap
- object classid
- object codebase
- object data
- object usemap
- q cite
- blockquote cite
- ins cite
- del cite
- form action
- input src
- input usemap
- head profile
- base href
- script src
- frame src
- frame longdesc
- iframe src
- iframe longdesc
- applet codebase
You were able to fix your problem because you are using a visible-source CMS which you have the necessary skills and experience to hack on. If some company decides to go XHTML on a large site where they're using some closed-source CMS and they find their visitors are getting parsing errors rather than the content, there's often not a lot they can do about it until the vendor sends an engineer to fix the problem.
The Coffee Table guys have neither the experience, the interest, nor the Unix permissions to hack on their CMS.
And yet, they haven't (so far) needed any interventions from me to bail them out.
If BigCo cannot supply their clients with a CMS as robust as the hacked-up copy of MovableType that the Coffee Table uses, then they deserve to have their clients flee to some superior open-source solution.
In some cases they can switch the content type back to text/html. But if you're serving stuff that works as text/html, what was the benefit of sending XML to the client in the first place?
I agree entirely. I've said many times that if you're not using XHTML for something you couldn't do more simply in HTML, you are probably wasting your time.
Of course, this doesn't preclude people using XML internally for data storage and transformation.
Your internal XML-handling tools will break just as surely if the data they are handling is ill-formed.
Posted by Jacques Distler atIf BigCo cannot supply their clients with a CMS as robust as the hacked-up copy of MovableType that the Coffee Table uses, then they deserve to have their clients flee to some superior open-source solution.
The difficulty arises if there is a single CMS that fulfills all the other needs of the company but has hidden bugs in the XML support. Sadly, there's not always an open-source solution that does everything (otherwise you wouldn't be using MovableType and we'd all have bulletproofed XHTML weblogs by now)
Your internal XML-handling tools will break just as surely if the data they are handling is ill-formed.
Sure, but the whole world won't know about the problem :) I think a lot of places would consider that to be enough of a reason to use XML in one domain but avoid it in the other.
Posted by jgraham atMark, Gavin: give this a try.
Posted by Sam Ruby atSadly, there's not always an open-source solution that does everything
I don't think there are a lot of solutions --- closed- or open-source --- for producing bulletproof XHTML. Presumably, that explains the continued cachet of getting onto Evan's list.
(otherwise you wouldn't be using MovableType and we'd all have bulletproofed XHTML weblogs by now)
I am using MovableType because:
1) I know Perl and so can hack on it.
2) MT was the best, most extensible and feature-rich weblogging tool available a year and a half ago, when I started.
There are various people playing around with bringing mathematical authoring (MathML) to open-source CMSs. But they're just starting out, and they've a long way to go before they're ready for "production use."
Which is a long-winded way of saying that I'm not about to switch anytime soon.
Posted by Jacques Distler atWorks beautifully for me, thanks. I almost never use things like base @href in a post, so I'll probably strip a few out, but it'll be nice to cover a few other things where I might forget, like forms.
Posted by Phil Ringnalda atAtom, Apache, PHP, MT, you name it, Sam Ruby will code for it....
Excerpt from phil ringnalda dot com atWhich is a long-winded way of saying that I'm not about to switch anytime soon.
I'm not suggesting that you should. I'm merely saying that people may try to use XML in a system that has imperfect XML support because they need the other features that system has to offer. You are a good example of this. If, in this situation, the user could not hack the source when confronted with an XML parsing problem, it would become a difficult-to-fix issue.
Posted by jgraham atThere was a stray print statement on line 80 in the original version I posted. "Some guy" named Jacques Distler brought this to my attention. ;-)Just to be clear on what I did: I took a list that Mark Pilgrim posted, and converted it to a hash...
Excerpt from phil ringnalda dot com: New MT plugin author: Comments atThanks for the input, I've updated the plugin to be able to look for all the link types HTML::Tagset knows about, or just A HREFs, or whatever you specify:
Posted by Gavin atHey, my name’s Mark also.
Posted by Mark at