Definition of "variant" and "whole string variants"

  Definition of "variant" and "whole string variants"
Dear colleagues,

I would like to highlight two areas in the report:

1. The definition of "variant"
The report uses a working definition of variant: "exhibiting some relationship 
such that one of the variants may, under some circumstances, somehow be 
conflated with another". In other words, at this stage, we are not talking 
about variants as being one or more of the scenarios that the report lists in 
its classification of variants as discovered. One reason for this may be that 
the six case studies, although "an important milestone in the work in this 
area" do not cover all of the world's scripts. For example, Hebrew, Thai and 
many Indian scripts, some major, are missing. Perhaps linguists with experience 
in scripts not covered could be asked to read the integrated report and look 
for any script-specific issues. Only then will all variant scenarios become 
clear and it will be possible to decide which are implemented in the root zone.

2. Whole string variants
Whether or not whole string variables ("linkage of elements of different 
dialects (or registers) of a single language") are to be recognized is a key 
issue. The example that came up in the Case Studies was the Dimotiki and 
Katharevousa dialects of Greek, although the Case Study did not recommend that 
dialectal variants be placed in the root. One could imagine similar cases in 
languages such as:
1. Chinese. Some Chinese words are used more often in mainland China, others 
outside it. See 
http://en.wikipedia.org/wiki/Taiwanese_Mandarin#Different_preferred_usage for a 
list .
2. German. Pre-Neue Rechtschreibung German (There was a major spelling reform 
in German in 1996, but many do not use the new orthography.)
3. Norwegian. Bokmål and Nynorsk (the current two Norwegian orthographies)
4. Bahasa Indonesia. Pre-Perfected Spelling System Indonesian (There was a 
major spelling reform in 1972, but old spellings are still common.)

There could also theoretically be a general case for Latin script IDNs e.g. 
.düßeldorf to have ASCII variants: e.g. .duesseldorf

"As noted above, linguistic variants are not amenable to algorithmic treatment, 
because the linguistic principles that cause them do not usually exhibit 
complete regularity. To produce a general solution to any general case of 
linguistic variants (even in a single language), it would be necessary to 
generate (in advance) a dictionary or set of dictionaries that would govern the 
handling of submitted strings; this would require the engagement of relevant 
expertise for each language with linguistic variants."
With the possible exception of the Chinese example in 1. above, where the 
number of words differing in and outside mainland China seems to be small, the 
creation and maintenance of such dictionaries is likely to be a Herculean task. 
(It is interesting that there are, however, Chinese variant character-level 
tables at www.iana.org/domains/idn-tables which do map Traditional Characters 
to Simplified Characters.)
In the words of the report:
"Recognizing the need for usability of multiple scripts in the DNS, and the 
desirability of reasonable approximations of natural language usage, it is also 
assumed that users are not dependent on the ability to use the full natural 
language without restrictions, and will be able to accommodate certain 
limitations to the full natural language where necessary."
"The goal should be to maximize efforts toward the prevention of future 
problems, and to minimize active entries in the DNS to those where an explicit 
need has been established, the user experience implications have been fully 
studied, and no negative impacts to security or stability have been identified."
Surely if examples such as www.colour.com and www.color.com have always been 
considered two different websites, all of the above five (Chinese, German, 
Greek, Indonesian and Norwegian) possible justifications for whole string 
variants should not be allowed.


Chris Dillon.
Research Associate in Linguistic Computing
Department of Information Studies
University College London, Foster Court
Gower Street, London WC1E 6BT
Tel +44 20 7679 1599 (inside UCL: 31599) 

