Re: [gnso-idn-wg] Re: Banning CCHH anywhere in a label
I read your email twice. On the second instance I managed to grok it. You have certainly presented some very interesting examples, along the lines of how one can exploit badly written code to cause confusion / spoofing. This is definitely beneficial as it gets shows the intricacies of the algorithm that IDN is so much dependent upon.
Before we go on hypothesizing what a bad program can do, is there any empirical evidence of sloppy IDN programming practice in the past or present?
Tan Tin Wee wrote:
This is of course a bad case of misreading the IDNA RFC. There are dozens of other programming errors that could occur and we couldn't possibly prevent them all with policy.So if we took something like an intended label citibankäå just to use Ram's example and stick xn- after that, like citibankxn-äå A programming properly written will pick up the whole label, put a prefix xn-- and pull out the ascii characters, citibankxn- (including the dash), e.g. "xn--citibankxn-" and then stick in a hyphen to separate the ascii part from the unicode encoding part, and then append the the unicode encoding of äå which in this context, is "b28qq03g" and form the ACE label:
Perhaps it would be more politically correct to say that .CHINA IDN TLD is currently in test bed phase, and is resolvable by using client plug-ins that tacks on the ".cn" suffix.which was deployed for the past three years within China and resolves cleanly for a whole bunch of domains with .<CHINA> in IDN TLD.
This would be another case of something else that could go awry in the interpretation of the specs.So if an application detects punycode by searching for xn-- anywhere in a string wrongly picks out the second xn-- instead of the first xn--, like in the following example
Reserving the CCHH prefix is really done to protect us from a) protocol changes requiring a prefix change; and b) registries who do not offer IDNs - so that IDNs cannot be inadvertently registered in an otherwise ASCII only space.
I doubt the RNWG had the intention of reserving names for the sake of protecting users from sloppy programming. If that is the case, TLDs with more than 3 characters would never have been created as so many legacy software were hardcoding the list of TLDs or applying length restrictions to the TLD.
Now I have followed the Punycode development half a decade before, but without pondering over it carefully once again.... wow, my head hurts right now, just thinking about this work to do.... Can someone help me here!
If a professor's head hurts, imagine the rest of us! ;-)
I don't recall such discussions; if there were, I may have missed them. You demonstrated that in some scenarios, if an application incorrectly picks the second xn-- sequence and call ToUnicode on it the result is some sequence of Unicode characters that may not make sense. However, I'm not sure what difference it makes whether conversion fails or succeeded with meaningful or unmeaningful Unicode.Will a Unicode string prefixed by an ascii string ending with xn- ever generate a shorter xn--PUNYCODE string, which when ToUnicode'd, produces another legitimate Unicode string or will it simply trigger an error which can be error-trapped? So far in the simple example with citibank-<CHINA> it looks like error-trapped in Verisign's Punycode converter. Edmon, Will, do you remember any of the millions of exchanges about IDNA way back five or so years ago contained this kind of scenario?