Unicode and weblogs
Hossein Derakhshan: We should promote Unicode standard among English speaking programmers. Many tools do not work well with Unicode and this sucks.
I'm doing my part. It took only a few lines of code for me to convert my weblog over to utf-8 (plus changing the content type in a few templates and a configuration file... bah). Jacques Distler updated MTStripControlChars.
Character sets seem to be a classic leaky abstraction. Java has excellent Unicode support on the inside, but you still need to worry about the last mile problem. David Czarnecki has his blojsom weblog working but apparently had to tweak is feed. Similar story for Simon Brown, but java.blogs garbles the post in translation. It took a few lines of change to get roller working, and it looks like a few more lines will be required.
How do some of the .Net and PHP based weblogs fare?
Full-blown utf-8 support has been part of my home-brew weblogging system from day 1 thanks to libxml2.
Posted by Thijs van der Vossen at
I know PHP by default outputs ISO-8859-1, and ASP.NET defaults to UTF-8. Both can be overrided (although incredibly easier in ASP.NET than PHP, at least the last time I had a look at it), but I will be pretty surprised if you see any large pack of developers doing this in any of the frameworks.
Posted by Asbjørn Ulsberg at
I'm having problems with your commenting system, btw. I can't have 'ø' in my name (I have to HTML encode it), and when I [Preview], the <textarea> is empty, so I have to copy/paste the stuff I've written into it to submit the comment. The error message I receive when posting with 'ø' in my name:
CGI Failure
traceback:Traceback (most recent call last):
File "gateway.cgi", line 39, in ?
post()
File "/home/rubys/mombo/post.py", line 237, in post
print template(searchList=[data, config])
File "/home/rubys/mombo/template/comment.py", line 276, in respond
write(filter(VFN(VFS(SL + [globals(), _builtin_],"parent",1),"get",0)('name',''), rawExpr="$parent.get('name','')")) # from line 152, col 64.
File "Cheetah/Filters.py", line 106, in filter
UnicodeError: ASCII encoding error: ordinal not in range(128)
Outputting utf-8 in PHP doesn't strike me as all that hard:
header( 'Content-type: text/html; charset=utf-8' );
(Well, other than the slight challenge of remembering to do that before anything that triggers PHP to start sending output.)
Posted by Phil Ringnalda atIñtërnâtiônàlizætiøn category
Iñtërnâtiônàlizætiøn All internationalization tests pass. Entry title, text (includes filename on disk): Check Category name and description (includes directory on disk): Check Comments: Check (includes comment e-mail) Trackbacks: Check (includes...Excerpt from Bedeviled Mojo Slop (Reloaded) at
これは日本語のテキストです。読めますか?
Let's see how Unicode and weblogs does with Japanese :) これは日本語のテキストです。読めますか?...Excerpt from Bedeviled Mojo Slop (Reloaded) at
Asbjørn: the problem should be fixed. You should be able to put your name in directly (in other words, without having to HTML encode it first).
Posted by Sam Ruby at
これは日本語のテキストです。読めますか?
Let's see how Unicode and weblogs does with Japanese :) これは日本語のテキストです。読めますか?...Excerpt from Bedeviled Mojo Slop (Reloaded) at
Iñtërnâtiônàlizætiøn category
Iñtërnâtiônàlizætiøn All internationalization tests pass. Entry title, text (includes filename on disk): Check Category name and description (includes directory on disk): Check Comments: Check (includes comment e-mail) Trackbacks: Check (includes...Excerpt from Bedeviled Mojo Slop (Reloaded) at
I was sure I'd created the database with "-E Unicode". Oh well. Javablogs should be serving UTF-8 properly now, although old data will still be broken unless/until it can be refetched from the original RSS feeds.
Posted by Charles Miller at
Javablogs Now Serving UTF-8
Sam Ruby pointed out|http://www.intertwingly.net/blog/1763.html that Javablogs was garbling posts that contain highbit characters. This was mainly due to the database having been created in ASCII mode. Javablogs is now serving pages in full UTF8,...... [more]Trackback from Confluence: Javablogs News and Updates at
Let us test some Hindi Text
देखें हिन्दी कैसी नजर आती है। अरे वाह ये तो नजर आती है।
BTW I have a MT blog in Hindi at [link].
Pankaj
Posted by Pankaj Narula atPhil, stating that you output in UTF-8 isn't the same as actually outputting in UTF-8. The last time I checked, PHP didn't do any magic when setting the 'charset' with the header() function -- the characters will still be encoded with ISO-8859-1. In ASP.NET, though, this magic does happen, just by changing two values in web.config.
Posted by Asbjørn Ulsberg at
I think I'm pretty much there now. Everything internally is being represented okay but I was having some problems actually getting the XML feed to be streamed out as UTF-8 (rather than ISO-8859-1), even though the rest of my site was working fine. TrackBacks were also causing me some problems but this was a simple as changing the "Content-Type" HTTP header.
Posted by Simon Brown at
What else needs to be done to a PHP-based website, besides using the header() function, to deliver UTF-8? I would love to expand the article I wrote on serving XHTML properly to incorporate this.
Posted by Simon Jessey at
Simon, if I'm not mistaken (I rarely program any PHP anymore), you also need to use the utf8_encode() function to actually serve the bytes as UTF-8. If you just set the 'Content-Type' header with the header() function, the script only declares that it uses UTF-8 -- the bytes are still served as ISO-8859-1.
I'm not sure whether this has changed in later versions of PHP, though.
Posted by Asbjørn Ulsberg atThank you for that information, Asbjørn. I will investigate further.
It is cool that some of the blogging tools are already being readied for advanced character sets. As someone with a home-brewed system, I am always interested to learn about any techniques that can make websites more accessible, and I consider character sets to be an important accessiblity consideration.
Posted by Simon Jessey atSimon Jessey, you might also want to cleanse your output of windows-1252 characters. Here's an example (source)
Posted by Sam Ruby at
Well, depends on whether you actually want to do things or not. I was thinking in terms of passthrough, just reading a utf-8 text file and echoing it out. If you want to use string functions in between, then I think you need to either compile in the multibyte string extension or use brilliantly sick workarounds.
Posted by Phil Ringnalda at
The multi-byte string extension looked like a nice solution, but having to compile stuff to get it work is IMHO a bit overload. My thoughts go to all the thousands and thousands of developers that doesn't have control over their web-servers, and doesn't have understanding administrators that can do this for them. Many of these doesn't even pay for their hosting service, and thus can't really demand anything either.
It's sad that PHP doesn't support Unicode natively, and by default. I hope this will be corrected in future versions of the framework. Maybe it's already corrected in PHP 5.x(?).
Posted by Asbjørn Ulsberg atSam, when I press [Preview], everything previews fine, but the <textarea> is empty. So when I then press [Post], I get to a completely blank and empty page. A fix seems to be to copy the text I write in the primary comment form, paste this into the preview form, and then submit. Can you please make sure the comment gets its way into the <textarea> in the preview form as well?
Posted by Asbjørn Ulsberg at
Asbjørn, I see that you are using Opera. Unfortunately, I see no problems using IE or Mozilla. The last change I made which might have affected this area was the utf-8 change, but you have posted since then, so I am at a loss as to what might have caused it.
Can you "view source" on the page that results after you push [Preview]? Do you see something like:
<textarea cols="59" name="comment" rows="12">Asbjørn, I see that you are using Opera. Unfortunately, I see no problems using IE or Mozilla. The last change I made which might have affected this area was the utf-8 change, but you have posted since then, so I am at a loss as to what might have caused it. Can you "view source" on the page that results after you push [Preview]? Do you see something like:</textarea>
Posted by Sam Ruby at
I ate tin
Sam Ruby has kick started a wonderful meme of I18N awareness. I'm admittedly a late-comer to understanding the gory details of all of it. I've recently, for Lucene in Action, dug deeper into how to "analyze" text of other languages. Sam's blog...Excerpt from Erik Hatcher - Blog at
Sam, the <textarea> actually has the content, but it isn't visible. I don't understand why. I'll look into it -- it might be a bug in Opera (I'm using 7.5 Beta 1). Something else: why not smack some <label>s into the comment form, at least around the «Remember info?» text?
Posted by Asbjørn Ulsberg at
I don't use haskell because last time I checked, unicode wasn't a first class concept in an otherwise very cool languag
Posted by doug ransom at
Unicode
Sam recommends to spread the gospel of the Unicode. I say amen to that. I already wrote an extensive...... [more]Trackback from chaotic intransient prose bursts at
All Unicode-enabled websites can have this icon (there's a french one as well) on them to declare to the world that they support the standard. The icon should of course link to the Unicode website.
Posted by Asbjørn Ulsberg at
Iñtërnâtiônà lizætiøn
I just stole this from Anne's weblog. I wanted to test this stuff anyways. How does my weblog perform using unicode. See also: Survival guide to i18n. Some tests: ã“ã‚Œã¯æ—¥æœ¬èªžã®ãƒ†ã‚ストã§ã™ã€‚èªã‚ã¾ã™ã‹ Let’s see how Unicode...Excerpt from Russell Beattie Notebook at
UTF-8 and Web Site Development
This post will contain some tips on how to set up your web development process to use UTF-8 end to end. What happened was, I saw a pair of posts by Sam Ruby (Unicode and weblogs, Aggregator i18n tests). I can be a bit of a careful (read: slow)...Excerpt from Robert Hahn: inspired by integration at
Anne
Monday 31 May 2004 00:14 Nee, niet heel veel. utf-8 is alleen wel veel universeler en fijner imo. Zie ook: [link] en [link] en...Excerpt from GoT at
Hi I am new to web design and have just started learning unicode and how to make UTF-8 website. I would like to offer my web design service for bi-lingual users, namely English and Chinese, and believe the UTF-8 is the way to go. Am I being naive to think that making the <'Content-type: text/html; charset=utf-8'> is the answer for utf-8 website development or it's more complicated than I thought?
And how do I convert my Chinese to Unicode?
大家好!希望能在这里学到点东西。
tee 郑玉萍
Posted by Tee Peng atLet’s try some archaic stuff. First, some runes, which should be pretty easy--they’re on the Basic Multilingual Plane. Follow the link for free fonts.
ᚠᚢᚦᚬᚱᚴ ᚼᚾᛁᛅᛋ ᛏᛒᛘᛚᛦ - Runic, BMP ("fuþork hnias tbmlR")
Something a bit harder: The first two words of the Lord’s Prayer in Gothic, which is on the Supplementary Multilingual Plane.
𐌰𐍄𐍄𐌰 𐌿𐌽𐍃𐌰𐍂 - Gothic, SMP ("atta unsar")
Well, they both look good in the preview.
Posted by Eric atCool! Both worked correctly using MacOS X Firefox 1.5. You even handle the SMP characters (>= U+10000) correctly.
FWIW, Google can search for runes, but not for Gothic.
Posted by Eric (ᛂᚱᛁᚴ) at
blojsom never had a problem with encoding and always had a valid feed. The issue was on Simon's end but all seems to be better on his end now :)
Posted by David Czarnecki at