ASCII, ISO-8859-1, UCS, and Erlang
Tony Garnock-Jones: Erlang represents strings as lists of (ASCII, or possibly iso8859-1) codepoints. In this regard, it’s weakly typed - there’s no hard distinction between a string, “ABC”, and a list of small integers, [65,66,67].
It is important to realize that Erlang was invented (in 1987) before utf-8 was (in 1992).
Codepoints
Let’s explore the relationship between ASCII, ISO-8859-1, and UCS (a.k.a. Unicode), by way of example.
First, let’s look at U+0043: Latin capital letter C. The codepoint for this character in UCS is 67 decimal. The codepoint for this character in ISO-8859-1 is 67 decimal. The codepoint for this character in ASCII is 67 decimal.
Next, let’s take a look at U+00C7: Latin capital letter C with cedilla. The codepoint for this character in UCS is 199 decimal. The codepoint for this character in ISO-8859-1 is 199 decimal. This character can’t be represented in ASCII.
Finally, let’s look at U+0421: Cyrillic capital letter Es. The codepoint for this character is 1057 decimal. This character can’t be represented in either ASCII or ISO-8859-1.
Given no other information, I would suggest that a string in Erlang be treated a list of UCS codepoints, where UCS is a proper superset of ISO-8859-1, which in turn is a proper superset of ASCII.
Binary
As of Unicode 5.0.0, 102,012 code points are defined. This number is a bit larger than 256, which is the number of possible values that can be stored in a byte. So, in general, UCS codepoints will require more than one byte to be represented.
ASCII is simple. Everything is stored in one byte. A bit incomplete, but simple.
ISO-8859-1 is simple. Everything is stores in one byte. A bit incomplete (but not as incomplete as ASCII), but still simple.
UTF-32 is simple. Everything gets 32 bits. A bit wasteful, but simple.
UTF-16 is nearly as simple. Code points less that 65,536 are stored as two bytes. Everything else is stored as four. This works as the range of UCS isn’t contiguous, in particular, the range of U+D800 to U+DFFF is reserved for “surrogate characters”.
Be forewarned that there actually are two version of UTF-32 and UTF-16, depending on whether your machine is big endian or little endian.
UTF-8 is simple for those characters which it shares with ASCII. Those characters require only one byte. Everything else requires more than one byte. So a Latin capital letter C is 0x43 in UTF-8. A Latin capital letter C with cedilla is 0xC387. A Cyrillic capital letter Es is 0xD0A1. One important aspect of UTF-8 is that it is rare that a given sequence of bytes which contains at least one non-ASCII character can be interpreted as a UTF-8 character.
For this reason, I would suggest that an RFC 4267 JSON codec for Erlang that choses to represent strings as binary make the assumption that binary sequences are UTF-8; and furthermore that those bytes that can not be interpreted as UTF-8 be treated as ISO-8859-1. That sounds complicated, but that’s exactly what this patch does, i.e., if the next two, three, or four bytes match one of the utf-8 patterns, then those bytes are treated as a single character, otherwise that one byte is treated as a single character.
If this approach is taken, all ASCII binary streams will be interpreted corrected, as will all UTF-8 binary streams. As a bonus: nearly all ISO-8859-1 binary streams will be too.
Converting a string to binary in Erlang
Converting a UCS string to binary can be done with list_to_binary(xmerl_ucs:to_utf8(Value))
. This pair of function calls will work for all positive integers which represent valid Unicode codepoints, including all codepoints that may be defined in the foreseeable future (and, yes, from time to time, new characters are added).
Converting an ISO-8859-1 string to binary can be done with list_to_binary(Value)
. This function call will fail if it encounters an element of the list which is greater that 256. This call will result in a same binary representation as the previous call for all codepoints less than 128. It will result in a different binary representation than the previous call for all codepoints greater than 127.
Converting an ASCII string to binary can be done with either of the above two methods.
Footnotes
For completeness, there are two other things that may be worth exploring. Neither require much in the way of code, merely a few additional patterns to be matched.
Unicode reserves a single character (zero-width no-break space) as a Byte Order Mark. By matching a specific value against the first two, three, or four bytes of a binary stream, you can determine if the rest of the stream is UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
There are 27 characters which are often mis-encoded on Windows machines. These patterns are easy to look for and correct. I previously provided a similar function for Ruby which Jim Weirich picked up and included in Ruby’s XML Builder.
Tony, if you are interested in pursuing any of these ideas in rfc4627.erl, I can provide test cases and/or patches. Let me know.