Re: [gnso-idn-wg] Re: Banning CCHH anywhere in a label
Avri Doria wrote:
On 9 mar 2007, at 13.37, Tan Tin Wee wrote:citibankäå
Yes, this is a mix of Latin (ASCII subset) and Han scripts. However, as many on the list have said, for many scripts, ASCII can be mixed in without causing much confusion. It is allowed on the second level in most (careful) IDN implementations, as demonstrated on the language tables published on the IANA IDN Language Table Registry. This has also been explained in the ICANN IDN Guidelines:
3. (a) In implementing the IDN standards, top-level domain registries will *associate each label* in a registered internationalized domain name, as it appears in their registry *with a single script* This restriction is intended to limit the set of permitted characters within a label. If greater specificity is needed, the association may be made by combining descriptors for both language and script. Alternatively, a label may be associated with a set of languages, or with more than one designator under the conditions described below. (b) A registry will publish the aggregate set of code points that it makes available in clearly identified IDN-specific character tables, and will define equivalent character variants if registration policies are established on their basis. Any such table will be designated in a manner that indicates the script(s) and/or language(s) it is intended to support. (c) All code points in a single label will be taken from the same script as determined by the Unicode Standard Annex #24: Script Names at http://www.unicode.org/reports/tr24. *Exception to this is permissible for languages with established orthographies and conventions that require the commingled use of multiple scripts*. In such cases, visually confusable characters from different scripts will not be allowed to co-exist in a single set of permissible codepoints unless a corresponding policy and character table is clearly defined. (d) All registry policies based on these considerations will be documented and publicly available, including a character table for each permissible set of code points, before the registration of any IDN associated with such an aggregate may be accepted.
A well-known example of permissible language is Japanese, where one could combine Han, Hiragana, Katakana, and ASCII subset of Latin.
A well-known example of bad-practice would be to allow Cyrillic and Latin to be combined within a single label.
I think there are two issues whenever we discuss the topic of "single-script adherence" and I asked for clarification on the last teleconference. However, I suspect we still have not grounded the discussions on one or the other. To be clear, there are two possible way one can interpret "single-script adherence across all labels":
1. Every label in a domain name string is composed of characters from a single script. However, one label may belong to a different script than another. E.g. ããã.espaÃa - there are two labels with one containing only Katakana and the containing only Latin.
2. All characters in every label of a domain name string is composed of characters from a single script. The example above ããã.espaÃa would be violating this policy. OTOH, ããã.ããããã would be ok since both labels are Katakana.
We need to make it clear in our recommendations if we mean either 1 or 2 above.
#1 above has already been somewhat covered by the ICANN IDN Guidelines. I don't think anyone would argue against this. Whether it could/should be enforced as a contractual requirement for new TLDs is up for discussion.
#2 is what I believe we have been discussing on the call and the list. I am of the view that restrictions should be applied using "SHOULD" language, just so as to discourage abuse. I'm sitting on the fence as far as whether we should enforce it.