It’s just data

Competent Language Designers

Rick Jelliffe: if you make up or maintain a public text format, and you don’t provide a mechanism for clearly stating the encoding, then, on the face of it, you are incompetent. If you make up or maintain a public text format, it is not someone else’s job to figure out the messy encoding details, it is your job.

I guess it would follow that Python and Perl are competent programming languages.


“Incompetent?”  What is it with XML gurus and the name-calling?

Seriously though, Korean text is usually in EUC-KR, not UTF-8, and it’s one of the easiest languages/encodings to auto-detect accurately, using a multibyte prober and some frequency distribution analysis.  And Korean (or anything else) in UTF-8 can be detected with a regular expression.  I’m not sure what the chances are that a random binary file would match, but I’m sure the chances go down as the file size goes up.

(Yah, those quotes were intentional.  Taken from here.  Savor the irony.)

Posted by Mark at

Random Thoughts, in no particular order:

Posted by Sam Ruby at

Encoded Python

PEP 263: This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. I wonder if this will work: #!/usr/bin/env...

Excerpt from Lenny Domnitser’s domnit.org at

python source encoding and unicode

A post from Sam Ruby on encoding issues, quoting a post by Rick Jelliffe on the same issues, prompted an interesting response by Lenny Dommitser . For those that didn’t follow the link, Lenny’s post includes an excellent example of the use of rot13...

Excerpt from Boxes and Glue at

Add your comment