Networking Research

IDNA Revisited

By: Wendy Rickard

Date: July 7, 2008

line break image

This article is based on an IDNA BoF held at IETF 71 in Philadelphia.

In the five years since the original protocol for Internationalized Domain Names (IDNs) was completed and deployed, a handful of new developments, together with a number of real or perceived defects or inadequacies in the original protocol, has led to a desire to revise and upgrade the standard. Those efforts, which are now being undertaken by the new IDNAbis working group (WG), were topics discussed during a BoF (birds of a feather) meeting in March at IETF 71 in Philadelphia.

IDN Controversies

Adopted in 2003, Internationalized Domain Names in Applications, RFC 3490, and associated documents, com-monly known as IDNA2003, was the first attempt at creating a truly multilingual Domain Name System (DNS) by making it possible to read and write domain names with characters that fall outside the ASCII repertoire. The protocol was based on the then-current version of Unicode, and it was designed to achieve maximum backward compatibility with the existing DNS. While the IDNA2003 initiative was by and large successful, a newer version of Unicode (version 3.2) has since been released, and a number of concerns about the protocol’s potential limitations and defects have been raised. Some of those concerns may have been the results of unrealistic expectations: domain names, for example, don’t generally map well into languages, and Unicode presents its own set of constraints. One problem that is fundamentally unsolvable in the general case is that there are characters in several scripts that simply look too much like other characters. That inevitably leads to confusion. Some of that confusion exists even among basic Latin characters. For example, unless fonts are chosen carefully, the zero and the letter O or the number 1 and the letter I may look too much alike for any user to easily recognize and discern.

In response to growing concerns about the ability of IDNA2003 to lead to a truly international DNS, the Internet Architecture Board issued RFC 4690 in September 2006. That document attempted to summarize the general problems and issues that were being discussed. It was also intended to present a framework for future development work, including the need to migrate to newer versions of Unicode. As IDN expert and IDNA WG participant John Klensin frequently points out, there are numerous issues connected to IDNs – in addition to the one mentioned earlier – that may not be capable of being resolved. Many of those may be more about the culture and traditions of language and writing systems than the results of technical limitations. As John warns, “We shouldn’t expect to write literature in domain names.”

Regardless of the validity of the com-plaints, dealing with newer versions of Unicode was generally recognized as an imperative, along with a handful of oth-er issues that could contribute to more expanded applicability of IDN. Around the time that RFC 4690 was published, a design team began working on a set of proposed revisions to the protocol. The revision was strongly influenced by discussions being conducted on an open mailing list including thousands of messages containing recommendations for revisions and adjustments. To date, the IDNAbis WG is reviewing the design team’s documents and deciding whether to pursue the recommendations offered by the team or to choose another path.

Key Issues

What are some of the key issues with regard to IDNA? At the IDNA BoF session in Philadelphia, John presented three. The first has to do with important characters and scripts that were excluded from the original IDN standard largely because they did not appear in Unicode 3.2. The proposed new model generalizes from the original LDH (letter-digit-hyphen) rule that was established in the first version of the DNS. That rule allows only letters, digits, and embedded hyphens but no punctuation or symbols. So far, the WG remains committed to retaining that rule, even though speakers of a particular language may regard the entire orthography of that language to be critical if effective communications are to be achieved. The DNS, as John points out, is about mnemonics, “not about writing novels,”? which means that some compromises should be expected.

As part of the first issue, while the IDNA2003 working group made an ef-fort to include as many Unicode characters as possible, doing so may have resulted in a handful of problems. Those problems are real to users of a language, even if the language has only a small number of speakers. Again, compromises are necessary. As John said, it may be equally important to “avoid the trap of thinking everything can fit into the DNS.”? In particular, the IETF “does not have a consensus mechanism for solving orthographic or linguistic disputes,”? he said.

The second set of issues involves scripts or individual characters that may have been inadvertently mishandled in IDNA2003. One example is the final form Sigma in Greek, which is not only a distinct character; it is also one that has significance for those who read and write in Greek. Unfortunately, the final form Sigma is not represented in IDNA2003. There are those who argue passionately that the omission should be corrected. To that John suggests that while the IETF does not have a way to resolve such disputes, “we should listen and try to encourage people to find a way to resolve the complaints.”

The third set of issues involves the actual structure of IDNA2003, which is Unicode-version dependent. Unfortunately, applications can’t recognize which version of Unicode is being used, and as a result, code points are being looked up that aren’t defined. “It isn’t easy to understand what is permitted and what isn’t, which makes extensibility and forward compatibility poor,” said John.

Concern has also been expressed that IDNA2003 is confusing with regard to terminology. For example, the standard is applied to labels, not to fully qualified domain names (FQDNs). Questions also remain concerning right-to-left (bidi) scripts and label separators. “There is a difference between mapping label separators and other parts of the FQDN,” said John. “Label separator mappings, if any, may need to be understood by even non-IDNA applications.” Compatibility between the two versions is a matter of perspective, says John, because what goes on the wire doesn’t change that much between IDNA2003 and the new IDNA2008, but what is permitted to go into the IDNA system does change. New terms – such as U-labels, A-labels, and LDH labels – get introduced to reduce confusion in other areas.

BoF chair Harald Alvestrand agreed that separators are key issues for IDNA, but he also said that in a more general sense, the problems are related to protocol issues. The most important differences, said John, lie in trying to define rules and mechanisms to which one can conform rather than an algorithm one can implement. “In some ways, the new approach is simpler, because as compared to IDNA2003, the mappings are gone and conversion to and from the punycoded version is symmetric and information preserving,” he said.

Proposed Changes to IDNA

In April 2008, the Network working group released an Internet Draft with proposed high-level changes from IDNA2003 to IDNA200x based on the IDNA design team documents. Those changes include:

  1. Update base character set from Unicode 3.2 to Unicode version-agnostic
  2. Separate the definitions for the “registration” and “lookup” activities
  3. Disallow symbol and punctuation characters except where special exceptions are necessary
  4. Remove the mapping and normalization steps from the protocol and have them instead done by the applications themselves, possibly in a local fashion, before invoking the protocol
  5. Change the way that the protocol specifies which characters are allowed in labels from “humans decide what the table of code points contains” to “decision about code points are based on Unicode properties plus a small exclusion list created by humans
  6. Introduce the new concept of characters that can be used only in specific contexts.
  7. Allowing typical words and names in languages such as Dhivehi and Yiddish to be expressed
  8. Make bidirectional domain names (delimited strings of labels, not just labels standing on their own) display in a nonsurprising fashion
  9. Make bidirectional domain names in a paragraph display in a nonsurprising fashion
  10. Remove the dot separator from the mandatory part of the protocol

Changes Proposed by Individuals in the Working Group
In addition to the changes above, individuals in the working group have proposed high-level changes. These include:

  • Add conversion between traditional and simplified Chinese characters
  • Add guidelines or requirements for registration of character variants, along the lines of RFC 3743

This work is being discussed on the mailing list at [email protected]. Also see draft-hoffman-idna200x-topics-03.

The issue of mappings appears to be critical in a number of respects. As explained in draft-ietf-idnabis-rationale-00.txt, which was posted in May 2008, issues in domain name identification and processing arise because IDNA2003 specified that several characters be treated as equivalent to the ASCII period (dot, full stop) character used as a label separator. As the draft states, “If a domain name appears in an arbitrary context (such as running text), one may be faced with the requirement to know that a string is a domain name in order to adjust for the different forms of dots but also to have traditional dots to recognize that a string is a domain name – an obvious contradiction.”?

The IDNA2008 model removes all of these mappings and interpretations – including the equivalence of different forms of dots – from the protocol, leaving such mappings to local processing. “This should not be taken to imply that local processing is optional or can be avoided entirely. Instead, unless the programme context is such that it is known that any IDNs that appear will be either U-labels or A-labels (representation in Unicode or encoded in ASCII using punicode encoding), some local processing of apparent domain name strings will be required both to maintain compatibility with IDNA2003 and to prevent user astonishment.”?

Bidi

One of the great challenges of IDNs is the ability to represent domain names in languages that are written from right to left (RTL), such as Arabic and Hebrew , and that have those names behave consistently and make sense in context. In IDNA2003, the rule is that the label can be RTL only if the first and last characters of each one are RTL. The problem, according to Harald, arises when confronted with nonspacing (or combining) marks. Some languages, including Yiddish and Dhivehi (the official language of the Maldives), have words that end with a combining mark that has no direction. Under IDNA2003 it is not possible to use such words, and because of that, it is not possible to use those languages. In response, Harald and Cary Karp proposed a new set of rules (in draft-ietf-idnabis-bidi-00.txt), which permits nonspacing marks at the end of a label and makes other changes. It was discovered, however, that some ASCII labels that appear next to some RTL labels could break.

Recommendations

Drawing on RFC 4690 and the WG drafts, John suggested several goals be met in the next version of IDNA, which was referred to as IDNA200x prior to the IETF 71 in Philadelphia and IDNA2008 since then. Those goals include a version of IDNA that is:

  • Unicode-version agnostic
  • Easier to understand
  • More predictable with regard to what happens when languages and scripts are applied
  • More adaptable to local conditions (realistic interoperability)

Those who are following IDN-and especially those who are working with it-are encouraged to understand that domain name internationalization is not just about the IDNA protocol and character rules. There are, in fact, many areas of responsibility that are required so as to make the system work well, including the standard protocol, the cooperation of registries and zone administrators at all levels of the DNS (as well as the need for registry restrictions), the need to educate registrants in order to minimize confusion, and the need to engage look-up implementers (and the developers of related applications). “The common sense that users need to possess to make IDNA a functional reality may require some education in order to develop,”? said John. “You can’t solve confusion, but you can provide better tools.”?

Summary

The work of the current IDN WG, called the IDNABIS WG, is to ensure the practical stability of the validity of algorithms for IDNs. It is currently or-ganized as four Standards Track documents. The charter of the WG is meant to untie IDNA from specific versions of Unicode using algorithms that define validity based on Unicode properties. Other goals include separating requirements for valid IDNs at time of regis-tration versus at resolution time, revising bidirectional algorithms to produce a deterministic answer as to whether or not a label is allowed, determining whether bidirectional algorithms should allow additional mnemonics labels, and permitting effective use of some scripts that were inadvertently excluded by the original protocols. The IDNABIS WG remains commit-ted to preserving and using the current Domain Name System and no substantially new protocols or mechanisms are expected.