ICANN ICANN Email List Archives

[ssac-gnso-irdwg]


<<< Chronological Index >>>    <<< Thread Index >>>

Re: [ssac-gnso-irdwg] Mixed scripts in contact information (subject line changed)

  • To: Dave Piscitello <dave.piscitello@xxxxxxxxx>, "Robert C. Hutchinson" <rchutch@xxxxxxxxx>
  • Subject: Re: [ssac-gnso-irdwg] Mixed scripts in contact information (subject line changed)
  • From: James M Galvin <jgalvin@xxxxxxxxxxxx>
  • Date: Tue, 01 Feb 2011 15:41:43 -0500


From a technical point of view, I don't think there's an issue. My
non-expert perspective on this issue is as follows.

Presuming that each language is defined by an appropriate set of Unicode code points and the keyboard is capable of generating an appropriate internal representation of the necessary code points, all that needs to happen is for the name to be converted by the client to a sequence of said Unicode code points. This is then encoded using UTF-8 and that is what is used when moving the data around.

To display a client simply reverses the process and then displays accordingly. If it doesn't understand a code point then it does something appropriate.

Sounds almost too easy doesn't it?!

Jim




-- On January 26, 2011 7:14:36 AM -0800 Dave Piscitello <dave.piscitello@xxxxxxxxx> wrote regarding [ssac-gnso-irdwg] Mixed scripts in contact information (subject line changed) --


This is what I suspected. Thanks for your research!

If "what I have on my keyboard" is the limiting factor, then do we
have something similar to variants w/r/t contact names?

Let me explain.

Where we allow people to register munoz.com with a tilde over the
"n", we ought to allow the registrant to use the tilde in the contact
information. That's a example of a rule of the kind "if you can use
the character in the IDN, you can use it in the contact information".

Suppose Mr. Munoz's given name is Jorgen, with a diacritical mark
over the "o". Do we allow an IDN jorgenmunoz.com with both
diacritical mark and tilde? I think these characters come from
different IDN tables. If the answer is no, then the implications are
that we have three different strings that Mr. Munoz could register in
COM

jorgenmunoz.com, where "n" and "o" are present (no marks)
jorgenmunoz.com, where "tilde" and "o" are present
jorgenmunoz.com, "n" and "o with diacritical" are present

What do we allow as contact information in each of these cases?
Suppose the same Mr. Munoz is the registrant for all three? Do we
really want 3 different registrant contact names?

Again, I'm not an expert here and more informed people than I perhaps
know how this is best managed.


On 1/26/11 3:21 AM, "Robert C. Hutchinson" <rchutch@xxxxxxxxx> wrote:

> Hi Dave et al,
> Thanks for the excellent comments/observations from everyone.
> A business friend of mine is the head of elections and Clerk
> Recorder [birth certificates] for Santa Cruz County, so I called
> her to glean the wisdom of how this is handled in US/California law.
> According to her office manager, basically there are no formal
> rules!..   The systems at the DMV[used for motor voter
> registrations and voter verification] allow only for A-Z in names
> and A-Z plus numbers and some special characters for addresses.
> This is entirely by convention.  The rule is you can have anything
> the system will handle - and is on my keyboard - but don't ask for
> anything extra.  Hey,it's the government... If you ask for a
> Spanish surname [on a birth certificate] spelled with tilde over
> the N , you get an N. Also- what goes in Santa Cruz County may not
> be true in Santa Clara County, so it is entirely possible that
> other counties operate by different rules. Nothing prevents us from
> adopting less restrictive character rules - but because, I believe,
> early IBM punch-card systems only did uppercase letters and
> numbers, this convention persists.
> I tried calling the California Secretary of State, but the person I
> reached was no help at all...
>
>
> On Tue, Jan 25, 2011 at 4:20 AM, Dave Piscitello
> <dave.piscitello@xxxxxxxxx<mailto:dave.piscitello@xxxxxxxxx>> wrote:
> Hi all,
>
> Again, apologies for missing yesterday's call.
>
> I have a question related to this discussion. In composing language
> tables with "legitimate" characters for a language, I began to
> wonder whether there are real world constraints on mixed scripts in
> the composition of names.
>
> For example, can a US citizen have a birth certificate where the
> given or surname contains letters other than A-Z? I believe a US
> citizen can have a name containing characters from extended ASCII
> sets (umlauts, tildes, etc). People often name their children
> unconventionally: could someone compose a name for my child that
> contained both an umlaut and tilde?) and would this be accepted as
> a legal name in the US (or other country)? Would a "yes" answer to
> these questions influence this discussion?
>
> Can a Chinese citizen have a surname that is composed of characters
> from one accepted Chinese script and a given name composed using
> characters from a second?
>
> Apologies if this is off topic. Feel free to send me away for more
> coffee.
>
> On 1/25/11 4:12 AM, "Robert C. Hutchinson"
> <rchutch@xxxxxxxxx<mailto:rchutch@xxxxxxxxx>> wrote:
>
>> Hello WhoIs IRD WG,
>> Here is my suggested questions for discussion between the Whois
>> IRD WG and ICANN IDN Staff / Tina Dam.
>> Reply with your clarifications and suggestions.
>> Thanks,
>> Bob Hutchinson
>>
>>
>> The WhoIs IRD WG is requesting expertise/assistance from the IDN
>> team. The WhoIs IRD WG is considering recommending that WhoIs
>> Internationalized Domain name registrant data [name and address]
>> for owner and contact be tagged
>> with language.   Furthermore, it would be advantageous to
>> constrain the content of language tagged fields to only the
>> legitimate characters of the tagged language.   Ideally we would
>> like to locate existing UTF-8 language tables and reference them,
>> rather than creating "ICANN WHOIS language tables".
>>
>> Based on reviewing the  IDN ccTLD Fast-Track Workshop slides,
>> http://sel.icann.org/node/6740/,  the IDN team addressed similar
>> issues surrounding the use of scripts, languages and character
>> sets. Apparently the IDN team decided that each TLD/registry would
>> define the language character sets acceptable for 2nd-level domain
>> names.  Those files are stored at IANA:
>> http://www.iana.org/domains/idn-tables/  and reference linked
>> character code pages.  This system provides the flexibility for
>> each TLD to define each language, but has the disadvantage [for
>> example] of defining the Swedish character set three different
>> ways.
>>
>> We would like to invite members of the IDN team to discuss the
>> following questions with the Whois IRD WG:
>> 1) Given the current state of IDN language definitions ­ are there
>> ways/suggestions that the existing IANA-IDN language definitions
>> could be leveraged to help with WhoIs  IRD?
>> 2) Did the IDN team explore or select a suitable established
>> ³standard² language tags/code? Like ISO 639-3
>> http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes  for
>> designating which language a domain name [TLD or second-level] is
>> encoded in? 3)  Are there other [ISO{8859/2022}/HTML?] language
>> code page standards which are UTF-8 based, which could be
>> used/leveraged to easily define WhoIs IRD language character sets?
>> 4) Help?  Any suggestions are greatly appreciated.
>
>








<<< Chronological Index >>>    <<< Thread Index >>>

Privacy Policy | Terms of Service | Cookies Policy