<<<
Chronological Index
>>> <<<
Thread Index
>>>
Definition of "variant" and "whole string variants"
- To: "idn-vip-integrated-issues@xxxxxxxxx" <idn-vip-integrated-issues@xxxxxxxxx>
- Subject: Definition of "variant" and "whole string variants"
- From: "Dillon, Chris" <c.dillon@xxxxxxxxx>
- Date: Tue, 17 Jan 2012 09:40:14 +0000
Dear colleagues,
I would like to highlight two areas in the report:
1. The definition of "variant"
The report uses a working definition of variant: "exhibiting some relationship
such that one of the variants may, under some circumstances, somehow be
conflated with another". In other words, at this stage, we are not talking
about variants as being one or more of the scenarios that the report lists in
its classification of variants as discovered. One reason for this may be that
the six case studies, although "an important milestone in the work in this
area" do not cover all of the world's scripts. For example, Hebrew, Thai and
many Indian scripts, some major, are missing. Perhaps linguists with experience
in scripts not covered could be asked to read the integrated report and look
for any script-specific issues. Only then will all variant scenarios become
clear and it will be possible to decide which are implemented in the root zone.
2. Whole string variants
Whether or not whole string variables ("linkage of elements of different
dialects (or registers) of a single language") are to be recognized is a key
issue. The example that came up in the Case Studies was the Dimotiki and
Katharevousa dialects of Greek, although the Case Study did not recommend that
dialectal variants be placed in the root. One could imagine similar cases in
languages such as:
1. Chinese. Some Chinese words are used more often in mainland China, others
outside it. See
http://en.wikipedia.org/wiki/Taiwanese_Mandarin#Different_preferred_usage for a
list .
2. German. Pre-Neue Rechtschreibung German (There was a major spelling reform
in German in 1996, but many do not use the new orthography.)
3. Norwegian. Bokmål and Nynorsk (the current two Norwegian orthographies)
4. Bahasa Indonesia. Pre-Perfected Spelling System Indonesian (There was a
major spelling reform in 1972, but old spellings are still common.)
There could also theoretically be a general case for Latin script IDNs e.g.
.düßeldorf to have ASCII variants: e.g. .duesseldorf
"As noted above, linguistic variants are not amenable to algorithmic treatment,
because the linguistic principles that cause them do not usually exhibit
complete regularity. To produce a general solution to any general case of
linguistic variants (even in a single language), it would be necessary to
generate (in advance) a dictionary or set of dictionaries that would govern the
handling of submitted strings; this would require the engagement of relevant
expertise for each language with linguistic variants."
With the possible exception of the Chinese example in 1. above, where the
number of words differing in and outside mainland China seems to be small, the
creation and maintenance of such dictionaries is likely to be a Herculean task.
(It is interesting that there are, however, Chinese variant character-level
tables at www.iana.org/domains/idn-tables which do map Traditional Characters
to Simplified Characters.)
In the words of the report:
"Recognizing the need for usability of multiple scripts in the DNS, and the
desirability of reasonable approximations of natural language usage, it is also
assumed that users are not dependent on the ability to use the full natural
language without restrictions, and will be able to accommodate certain
limitations to the full natural language where necessary."
and:
"The goal should be to maximize efforts toward the prevention of future
problems, and to minimize active entries in the DNS to those where an explicit
need has been established, the user experience implications have been fully
studied, and no negative impacts to security or stability have been identified."
Surely if examples such as www.colour.com and www.color.com have always been
considered two different websites, all of the above five (Chinese, German,
Greek, Indonesian and Norwegian) possible justifications for whole string
variants should not be allowed.
Regards,
Chris Dillon.
==
Research Associate in Linguistic Computing
Department of Information Studies
University College London, Foster Court
Gower Street, London WC1E 6BT
Tel +44 20 7679 1599 (inside UCL: 31599)
www.ucl.ac.uk/dis/people/chrisdillon
<<<
Chronological Index
>>> <<<
Thread Index
>>>
|