Definition of "variant" and "whole string variants"
Dear colleagues, I would like to highlight two areas in the report: 1. The definition of "variant" The report uses a working definition of variant: "exhibiting some relationship such that one of the variants may, under some circumstances, somehow be conflated with another". In other words, at this stage, we are not talking about variants as being one or more of the scenarios that the report lists in its classification of variants as discovered. One reason for this may be that the six case studies, although "an important milestone in the work in this area" do not cover all of the world's scripts. For example, Hebrew, Thai and many Indian scripts, some major, are missing. Perhaps linguists with experience in scripts not covered could be asked to read the integrated report and look for any script-specific issues. Only then will all variant scenarios become clear and it will be possible to decide which are implemented in the root zone. 2. Whole string variants Whether or not whole string variables ("linkage of elements of different dialects (or registers) of a single language") are to be recognized is a key issue. The example that came up in the Case Studies was the Dimotiki and Katharevousa dialects of Greek, although the Case Study did not recommend that dialectal variants be placed in the root. One could imagine similar cases in languages such as: 1. Chinese. Some Chinese words are used more often in mainland China, others outside it. See http://en.wikipedia.org/wiki/Taiwanese_Mandarin#Different_preferred_usage for a list . 2. German. Pre-Neue Rechtschreibung German (There was a major spelling reform in German in 1996, but many do not use the new orthography.) 3. Norwegian. Bokmål and Nynorsk (the current two Norwegian orthographies) 4. Bahasa Indonesia. Pre-Perfected Spelling System Indonesian (There was a major spelling reform in 1972, but old spellings are still common.) There could also theoretically be a general case for Latin script IDNs e.g. .düßeldorf to have ASCII variants: e.g. .duesseldorf "As noted above, linguistic variants are not amenable to algorithmic treatment, because the linguistic principles that cause them do not usually exhibit complete regularity. To produce a general solution to any general case of linguistic variants (even in a single language), it would be necessary to generate (in advance) a dictionary or set of dictionaries that would govern the handling of submitted strings; this would require the engagement of relevant expertise for each language with linguistic variants." With the possible exception of the Chinese example in 1. above, where the number of words differing in and outside mainland China seems to be small, the creation and maintenance of such dictionaries is likely to be a Herculean task. (It is interesting that there are, however, Chinese variant character-level tables at www.iana.org/domains/idn-tables which do map Traditional Characters to Simplified Characters.) In the words of the report: "Recognizing the need for usability of multiple scripts in the DNS, and the desirability of reasonable approximations of natural language usage, it is also assumed that users are not dependent on the ability to use the full natural language without restrictions, and will be able to accommodate certain limitations to the full natural language where necessary." and: "The goal should be to maximize efforts toward the prevention of future problems, and to minimize active entries in the DNS to those where an explicit need has been established, the user experience implications have been fully studied, and no negative impacts to security or stability have been identified." Surely if examples such as www.colour.com and www.color.com have always been considered two different websites, all of the above five (Chinese, German, Greek, Indonesian and Norwegian) possible justifications for whole string variants should not be allowed. Regards, Chris Dillon. == Research Associate in Linguistic Computing Department of Information Studies University College London, Foster Court Gower Street, London WC1E 6BT Tel +44 20 7679 1599 (inside UCL: 31599) www.ucl.ac.uk/dis/people/chrisdillon