By: Olaf Kolkman
My first computer was a Sinclair ZX81, a tiny little black box that, as Wikipedia explains, did not use ASCII but had its own character set. Character code 0 was space; codes 1-10 were used for block graphics; and codes 11-63 corresponded to punctuation, numbers, and uppercase characters. Character codes 128-191 were reverse video versions of the first 64 characters. Other codes represented BASIC keywords and control codes such as NEWLINE. There were no lowercase characters.1 At the time (I was 15 or 16 years old), I did not give that limited character set a lot of thought. And I never considered how its character codes might need to be encoded and mapped if the machine were ever to communicate with the outside world.
Today, billions of users and even more machines are able to communicate over the Internet, and limited character sets like ASCII are perfectly fine protocol elements for intermachine communications not designed for human consumption. When it comes to application content intended for human consumption, however, the IETF, through RFC 2277, required protocols to support UTF-8 encoding of the Unicode character set. The applications and operating systems are expected to translate UTF-8 to and from the user’s interfaces.
It is at the intersection where pure machine-to-machine and pure user-to-user communications meet that a lot of the challenges remain. When users need mnemonics to pass to their computers and they want those mnemonics to be written in their own languages and scripts, we will need to answer one particular question: How do we represent what users consider a valid and useful representation of an identifier in an unambiguous manner such that comparison, delivery, and matching can be done in a unambiguous and standardized manner? That underlying question impacts a broad range of work being done within the IETF, and it has been the motivating force behind specific work being done in the two Internationalizing Domain Names in Applications (IDNA) and IDNA-bis working groups, the Email Address Internationalization working group, and the recent Internationalized Resource Identifier BoF. These groups focus on identifiers that are in wide use and based on legacy representation.
At IETF 76 in Hiroshima, Japan, the Internet Architecture Board (IAB) provided an update for the community with regard to internationalization issues, including work being done by the IAB on a document that describes the various encoding schemes that exist for domain names, as well as the problems that occur when assumptions are made about the context in which those names are used.2 For example, resolver libraries will need to know whether the resolution happens via the DNS, where the encoding of the names is in Punycode, or via alternative encoding schemes, such as UTF-8 DNS name encoding, which is in use in some enterprises. Moreover, within applications, the use of internationalized identifiers can give rise to confusion, as is demonstrated when e-mail addresses (in a user’s native script) are included in the header and body of e-mail messages. It is clear that solutions to these problems are not trivial, which is why internationalization of identifiers and names in Internet protocols has been an ongoing area of interest for the IAB and, thus, why the IETF 76 plenary served as a continuation of the technical plenary presented at IETF 66.
That the problem of encoding-and in particular, the mapping of one encoding into another-can bite you at unexpected moments was demonstrated in the e-mail that was sent to announce the IETF 76 plenary.3 The e-mail was composed in an editor and then copied and pasted into a Web form that sends announcements to the IETF list. If you look at the announcement, you can see a few occurrences of
, which is XHTML character escape for the Unicode LINE SEPARATOR. Somewhere in the process of cutting, pasting, and CGI script handling, the original encoding present in the editor got translated into the XHTML escaped character, which is what ended up in people’s mailboxes.
While most of the casual readers will ignore occurrences of
in an e-mail text, those occurrences will become relevant in a comparison of strings (identifiers and names in particular); in other words, the question becomes, Is a line of text that is typed in reverse video on the Sinclair ZX81 actually the same as the one typed using the first 64 characters?
If you want to know more, please read the slide set,4 the transcript,5 and the draft,2 on which the IAB welcomes your feedback.
This article was posted on 24 January 2010 .