ICANN ICANN Email List Archives

[ssac-gnso-irdwg]


<<< Chronological Index >>>    <<< Thread Index >>>

Re: [ssac-gnso-irdwg] Mixed scripts in contact information (subject line changed)

  • To: Dave Piscitello <dave.piscitello@xxxxxxxxx>, "Robert C. Hutchinson" <rchutch@xxxxxxxxx>
  • Subject: Re: [ssac-gnso-irdwg] Mixed scripts in contact information (subject line changed)
  • From: Steve Sheng <steve.sheng@xxxxxxxxx>
  • Date: Fri, 11 Feb 2011 15:04:53 -0800

Hello all,

I just read through the Unicode Technical report #36 ( 
http://www.unicode.org/reports/tr36/ ) on Unicode security considerations. 
Their recommendation is that for IDNs the restriction set by registry should be 
between (2-4, 1 means no IDN). So this categorization may be useful when we 
consider the mixing of scripts in contact information.

Here are the sets of levels of mixing that they have defined:


 1.  ASCII-Only
    *   All characters in each identifier must be ASCII
 2.  Highly Restrictive
    *   All characters in each identifier must be from a single script, or from 
the  combinations:
    *   ASCII + Han + Hiragana + Katakana;
    *   ASCII + Han + Bopomofo; or
    *   ASCII + Han + Hangul
    *   No characters in the identifier can be outside of the Identifier Profile

           (Note that this level will satisfy the vast majority of Latin-script 
users.)
3. Moderately Restrictive

    *   Allow Latin with other scripts except Cyrillic, Greek, Cherokee
    *   Otherwise, the same as Highly Restrictive

4. Minimally Restrictive

    *   Allow arbitrary mixtures of scripts, such as Ωmega, Teχ, HλLF-LIFE, 
Toys-Я-Us.
    *   Otherwise, the same as Moderately Restrictive

5. Unrestricted

    *   Any valid identifiers, including characters outside of the Identifier 
Profile, such as I♥NY.org


Warm regards,
Steve


On 1/26/11 7:14 AM, "Dave Piscitello" <dave.piscitello@xxxxxxxxx> wrote:



This is what I suspected. Thanks for your research!

If "what I have on my keyboard" is the limiting factor, then do we have
something similar to variants w/r/t contact names?

Let me explain.

Where we allow people to register munoz.com with a tilde over the "n", we
ought to allow the registrant to use the tilde in the contact information.
That's a example of a rule of the kind "if you can use the character in the
IDN, you can use it in the contact information".

Suppose Mr. Munoz's given name is Jorgen, with a diacritical mark over the
"o". Do we allow an IDN jorgenmunoz.com with both diacritical mark and
tilde? I think these characters come from different IDN tables. If the
answer is no, then the implications are that we have three different strings
that Mr. Munoz could register in COM

jorgenmunoz.com, where "n" and "o" are present (no marks)
jorgenmunoz.com, where "tilde" and "o" are present
jorgenmunoz.com, "n" and "o with diacritical" are present

What do we allow as contact information in each of these cases? Suppose the
same Mr. Munoz is the registrant for all three? Do we really want 3
different registrant contact names?

Again, I'm not an expert here and more informed people than I perhaps know
how this is best managed.


On 1/26/11 3:21 AM, "Robert C. Hutchinson" <rchutch@xxxxxxxxx> wrote:

> Hi Dave et al,
> Thanks for the excellent comments/observations from everyone.
> A business friend of mine is the head of elections and Clerk Recorder [birth
> certificates] for Santa Cruz County, so I called her to glean the wisdom of
> how this is handled in US/California law.
> According to her office manager, basically there are no formal rules!..   The
> systems at the DMV[used for motor voter registrations and voter verification]
> allow only for A-Z in names and A-Z plus numbers and some special characters
> for addresses.  This is entirely by convention.  The rule is you can have
> anything the system will handle - and is on my keyboard - but don't ask for
> anything extra.  Hey,it's the government... If you ask for a Spanish surname
> [on a birth certificate] spelled with tilde over the N , you get an N.
> Also- what goes in Santa Cruz County may not be true in Santa Clara County, so
> it is entirely possible that other counties operate by different rules.
> Nothing prevents us from adopting less restrictive character rules - but
> because, I believe, early IBM punch-card systems only did uppercase letters
> and numbers, this convention persists.
> I tried calling the California Secretary of State, but the person I reached
> was no help at all...
>
>
> On Tue, Jan 25, 2011 at 4:20 AM, Dave Piscitello
> <dave.piscitello@xxxxxxxxx<mailto:dave.piscitello@xxxxxxxxx>> wrote:
> Hi all,
>
> Again, apologies for missing yesterday's call.
>
> I have a question related to this discussion. In composing language tables
> with "legitimate" characters for a language, I began to wonder whether there
> are real world constraints on mixed scripts in the composition of names.
>
> For example, can a US citizen have a birth certificate where the given or
> surname contains letters other than A-Z? I believe a US citizen can have a
> name containing characters from extended ASCII sets (umlauts, tildes, etc).
> People often name their children unconventionally: could someone compose a
> name for my child that contained both an umlaut and tilde?) and would this
> be accepted as a legal name in the US (or other country)? Would a "yes"
> answer to these questions influence this discussion?
>
> Can a Chinese citizen have a surname that is composed of characters from one
> accepted Chinese script and a given name composed using characters from a
> second?
>
> Apologies if this is off topic. Feel free to send me away for more coffee.
>
> On 1/25/11 4:12 AM, "Robert C. Hutchinson"
> <rchutch@xxxxxxxxx<mailto:rchutch@xxxxxxxxx>> wrote:
>
>> Hello WhoIs IRD WG,
>> Here is my suggested questions for discussion between the Whois IRD WG and
>> ICANN IDN Staff / Tina Dam.
>> Reply with your clarifications and suggestions.
>> Thanks,
>> Bob Hutchinson
>>
>>
>> The WhoIs IRD WG is requesting expertise/assistance from the IDN team.
>> The WhoIs IRD WG is considering recommending that WhoIs Internationalized
>> Domain name registrant data [name and address] for owner and contact be
>> tagged
>> with language.   Furthermore, it would be advantageous to constrain the
>> content of language tagged fields to only the legitimate characters of the
>> tagged language.   Ideally we would like to locate existing UTF-8 language
>> tables and reference them, rather than creating "ICANN WHOIS language
>> tables".
>>
>> Based on reviewing the  IDN ccTLD Fast-Track Workshop slides,
>> http://sel.icann.org/node/6740/,  the IDN team addressed similar issues
>> surrounding the use of scripts, languages and character sets.
>> Apparently the IDN team decided that each TLD/registry would define the
>> language character sets acceptable for 2nd-level domain names.  Those files
>> are stored at IANA:  http://www.iana.org/domains/idn-tables/  and reference
>> linked character code pages.  This system provides the flexibility for each
>> TLD to define each language, but has the disadvantage [for example] of
>> defining the Swedish character set three different ways.
>>
>> We would like to invite members of the IDN team to discuss the following
>> questions with the Whois IRD WG:
>> 1) Given the current state of IDN language definitions - are there
>> ways/suggestions that the existing IANA-IDN language definitions could be
>> leveraged to help with WhoIs  IRD?
>> 2) Did the IDN team explore or select a suitable established “standard”
>> language tags/code? Like ISO 639-3
>> http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes  for  designating which
>> language a domain name [TLD or second-level] is encoded in?
>> 3)  Are there other [ISO{8859/2022}/HTML?] language code page standards which
>> are UTF-8 based, which could be used/leveraged to easily define WhoIs IRD
>> language character sets?
>> 4) Help?  Any suggestions are greatly appreciated.
>
>





<<< Chronological Index >>>    <<< Thread Index >>>

Privacy Policy | Terms of Service | Cookies Policy