<<<
Chronological Index
>>> <<<
Thread Index
>>>
General comments on IRD-WG report ("Final Interim Report of the ICANN Internationalized Registration Data Working Group")
- To: ird-wg-report@xxxxxxxxx
- Subject: General comments on IRD-WG report ("Final Interim Report of the ICANN Internationalized Registration Data Working Group")
- From: John C Klensin <klensin@xxxxxxx>
- Date: Fri, 25 Feb 2011 15:33:08 -0500
Hi.
It wasn't clear where and how these comments were to be posted
when I responded last week, so I entered them on the WG's web
page. The note are repeated below for completeness of the
comment archive.
--John Klensin
-----------
This report was recently called to my attention and I've done a
careful reading. First of all, I'm extremely pleased that
there is now a serious attempt to address the issues of
internationalized data in registration databases. Parts of the
community argued that they should have been addressed as a
condition for IDN deployment. Some of the divergent practices
mentioned in the report are the consequence of a variety of
"Whois committees" ignoring these issues, so I hope those
problems can now be corrected.
The report is, unfortunately, somewhat buried and hard to find.
Worse, it asks for comments, but there is no comment forum
identified in either the report itself or on the WG web page
(https://st.icann.org/int-reg-data-wg/index.cgi?internationalized_registration_data_working_group).
There is provision for comments on that page, but, as of 1600
UT on 17 February, there are only three comments, none about
this report and two clearly spam.
Unfortunately, several elements of the report are, from my
perspective and experience, disappointing. I describe those
below, in three categories (comments within categories are in
no particular order)
Note that I have used the terms "registration database" or
"registration database services" (terminology consistent with
the report's Section 3.2(3)) below where the report uses
"Whois". I find the distinction between "WHOIS" and "Whois" a
little too subtle (and note that the report was unable to be
completely consistent about the distinction). For anyone
trying to access the report via a screen reader or other
text-to-speech capability, a distinction based on
capitalization is pathological.
I. Showstopper Issues
These issues are, in my opinion, sufficient to discredit the
entire report and the groups that produced or signed off on it:
I.1. Confusion about Basic Terminology
The document uses DNS terminology in a way that is likely to
introduce or exacerbate confusion. The problem is aggravated
by mixing terminology and references to what IETF calls
IDNA2003 and IDNA2008. Given that IDNA2008 was adopted by the
IETF over a year ago and that ICANN recommendations and
guidelines reflect IDNA2008 usage, having definitions based on
IDNA2003 is confusing at best and arguably inappropriate. As
an example of the more general terminology problem, there has
been extensive confusion in the community as to whether "an
IDN" refers to a single label, a fully-qualified domain name
containing at least one U-label, or a fully-qualified domain
name consisting entirely of U-labels. The IDNA2008
specifications go to some length to point out that distinction
amd to confine itself exclusively to labels. This report
reintroduces the confusion, both in the definition and in the
occasional use of "IDN label" (see, e.g., Section 4.4). A
similar problem exists with the report's "Internationalized
Registration Data (IRD)" definition. For example, if a
database holder decides to include both the U-label and A-label
forms of a registered name in its database, but all other data
are in ASCII, does that make the data record an IRD record or
not?
The report itself illustrates other example of the problems
associated with confusion about whether an IDN is a label or
part of a fully-qualified domain name. For example, while the
report's IDN definition says "...domain names", the "IDN
variant" definition says "the label".
I.2. Protocol Development Responsibility
The second paragraph of Section 3.3 is very interesting. It
begins (in boldface italics) with the statement "The second
objective of the IRD-WG is how to specify how to
internationalize the WHOIS protocol.". Never before has an
ICANN Working Group or committee asserted that its mandate
includes protocol design for the Internet. An assortment of
formal and informal agreements, contracts, and assertions by
ICANN leadership in various "governance" and "cooperation"
forums suggest that this is the province of the IETF and
perhaps other technical bodies and that ICANN will never
attempt to move into this space. Yet this report -- from a
joint GNSO / SSAC working group -- asserts that protocol
development is one of two key objectives and part of its
mandate.
This is particularly problematic because the report goes on to
assume that the WG, or some other ICANN entity, can change the
WHOIS protocol (see, e.g., "Preliminary Recommendation (1)" in
Section 5). The IETF approved a protocol (IRIS, mentioned
briefly in the report) that would have addressed many of the
data query and retrieval issues that the report discusses
including the absence of a standardized way to structure and
identify registration data. IRIS was never adopted by the
ICANN registry, registrar, or other registration data
communities. Whether the reasons for that were good or bad, an
adequate report on this subject should start with either an
analysis of why IRIS was inappropriate or a more extensive
analysis of features that would be needed in a different
protocol to replace or supplement WHOIS.
Section 4.3 ("What Capabilities Are Needed for Directory
Services in the IDN Environment?") would appear from its title
to address this group of questions, but, as described there,
the WG decided to address a question about user experience
instead and did not address the issue of capabilities of a
directory service except vaguely and by indirection.
I.3. Display Failures in Text and Examples
The report's example of a U-label (from the "A-label | U-label"
definition in Section 3.2) doesn't show a U-label. Instead, it
shows the universally-confusable "string of little boxes"
(repetitions of U+25A1 or equivalent). Little boxes also occur
in the examples used in Section 4.2. The apparent inability of
ICANN and its Working Groups to manage scripts and fonts
sufficiently well to make reports readable by the intended
audiences should serve as a warning to every reader of the
report about issues with IDNs and retrieval and display of
information in other than ASCII form. If ICANN cannot manage
to keep these no-information strings out of its own reports,
the community that is concerned that non-ASCII information
retrieval and display retain information could legitimately
conclude that goal is impossible.
I.4. Analysis and Recommendations -- Contact Information
The analysis in Section 4.5 is useful, as far as it goes.
However, it omits any analysis as to whether the Universal
Postal Union (UPU) recommendations (particularly the S42
templates) or the ISO Standards for transliteration of
characters from various scripts [ISO011] could be of use here.
It seems to me that those are important omissions, sufficiently
important to call the appropriateness of the composition of the
Working Group for this task if they were not aware of them.
Equally important, many of the components of that analysis fall
into the trap of assuming that the "user of Internationalized
Registration Data" will be able to understand all such data.
Once one permits data outside the ASCII repertoire in contact
records, the issues aren't "ASCII versus everything else" but
are instead "is the script (and perhaps the language) used in
the registration record useful locally". For someone in India
trying to access registration data, data in Cyrillic or Arabic
characters are likely to be at least as problematic as data in
ASCII (in practice, actually more so). This problem appears
particularly egregiously in Table 6, but is present in many of
the comments in the analysis.
I note that, despite the analysis, no recommendation has been
supplied on these subjects, even as a basis for discussion.
Section 6 poses the topic, in the most general way possible,
for community discussion. But no call for community comment
has been issued, perhaps because of experience with how
unfocused community discussions tend to be when presented with
questions as open-ended as those posed in the report.
I.5. Analysis and Recommendations -- Non-contact Registration
Information
Section 4.4 appears to address specific recommendations for
submission and registration data in local scripts. This
section has many issues that make the recommendations either
incomplete, unimplementable, or just plain wrong. For example:
(i) What does it mean when the report says "The IRD-WG members
did not discuss the internationalization of this field"? At
least for Dates, there is an established and widely-used
international standard for the representation for dates (ISO
8601). That standard is usually considered sufficient but it
uses European digits only. Use of digits other than European
ones could cause as much confusion and inaccessibility to
information (including the possibility of little boxes (see (3)
above)) as a anything else. So why wasn't this discussed?
(ii) Under "Email address", the section refers to an "IETF
standard for internationalized mail headers". There is no such
standard; RFC 5335 is an experimental specification that
certainly should not be cited in a ICANN requirements
(especially since there is IETF consensus that part of it is
inappropriate).
(iii) For "Registration Status", the section recommends
publishing the EPP status code so that localization is
possible. The principle is probably correct except that there
is no requirement that ccTLDs use EPP. At least some readings
of the plans for new gTLDs and recent decisions about relaxing
requirements for registrar-registry separation may further
narrow the use of that protocol. Is the WG recommending that
all registrars, registrars, and other entities who might supply
registration data adopt those codes and the model implied by
them so that they can be returned? Such a recommend would have
implications well beyond any consideration of IDN
registrations.
II. Other Important Substantive Issues
Some of these issues are problems with the text of the report;
others merely risk reader confusion or undermine the report's
credibility.
II.1 Variant Characters, Variant Labels, and Definitions
The report's definition of "Variant character" is completely
consistent with the JET report that is cited for that
definition. Unfortunately, it is not consistent with other
ICANN terminology, including the ccTLD Fast Track report and
procedures. Those procedures associates the term "variant"
with just about anything other than than a "single conceptual
character [that] has two or more graphic representations".
Indeed, some interpretations of the term in the Fast Track
context require that "variants" (its definition) be associated
with similar-looking (visually confusing) characters, i.e., the
JET [RFC3743] definition is just about excluded. Those issues
have been the source of large controversies in ICANN (some of
which are hinted at in Section 4.2 of the report); I would
expect the committee that prepared this report to be familiar
with them and to be sure that the report does not add to the
confusion.
By introducing the JET report's term "activated variant" and
"reserved variant" (terms that are not generally agreed upon
outside that community either) as if it clarified the meaning
of "variant" (they are largely orthogonal to the definitional
problem) in Section 4.2, the report only makes the confusion
worse.
II.2. Preferred Variant Labels
Separately from the above, although it avoids saying it
explicitly by circling the issue with "return the domain of
which it is a variant..." language, the third and fourth
bullets of Section 4.2 imply that there is always a "primary
name" or, in language closer to that of the JET specification,
a "preferred variant label". I don't believe there is any
consensus in the broader community on that issue. At a
minimum, the question needs to be asked, not presented as an
"observation" from the WG membership.
II.3. The Multilingual Internet
Members of the broader community have explained to ICANN
multiple times that the use of IDNs has little to do with a
"multilingual Internet" even though they make a slight
contribution to enabling use by people who use different
languages and writing systems (see, for example, my comments to
the ITU on the subject [Klen06]). While the organization,
composition, and presentation of registry databases may
actually be more relevant to that topic, tying this work to the
"demand for a multilingual Internet" mostly just distracts
people from the real issues.
II.4. The WHOIS Protocol and "International" Characters.
While RFC 3912 sadly added to the confusion, the (original)
WHOIS protocol (see below) does specify NVT ASCII (sic, see
below). It does so by requiring the use of a Telnet [Post83
and predecessors] connection without providing for option
negotiation. It is worth noting that Telnet uses octets with
the high bit set for other purposes than international
character sets. There is no question that there are a large
number of WHOIS protocol services on the network that allow
input or output of UTF-8, ISO 8859-x, or ISO 2022-based
encodings and input, output, or both, but none of them except
possibly the last are conformant to the WHOIS specification.
III. Editorial and Terminological Issues
A number of other aspects of the report aren't problematic if
the reader applies effort to determine what was probably met.
But doing that may lead to either misunderstandings or the
conviction that the report is incompetent. For example, the
definition of an "Internationalized Domain Name (IDN)" in
Section 3.2 would define "abc123" as an IDN label. That would
be a unique usage; certainly RFC 5890 doesn't consider it a
U-label.
III.1. Small Technical Errors
The report contains a number of small, but problematic,
technical errors. For example, RFC 3912 is not a description
of "the original WHOIS protocol" (italics in the last paragraph
of Section 3.1). The original protocol was described in RFC
812 and 954 and there is reason to believe that many conforming
implementations of the WHOIS protocol conform to the RFC 954
model. RFC 3912 is an attempt to discuss and update the
protocol to more contemporary times but, as a quotation from
that RFC in the report notes, under the assumption that IRIS
would rapidly become available and heavily used.
III.2. Citations of IDNA2003
The first paragraph of Section 3.3 cites a pair of
IDNA2003-related specifications (footnote 15 and 16; RFC 3490
and 3454). Neither should be relevant to any plans going
forward (see comment I.1 above). Much more important, they are
cited as "IDN guidelines [that] define how IDNs will be
composed and displayed". While that is true if one confuses
"standard" with "guideline" and those standards with "identify
how to internationalize domain registration data" (from the
first sentence of that paragraph in the report), the IDNA2003
specifications have absolutely nothing to say about
registration databases, their content, or their display. The
paragraph essentially goes on to say that, but it is misleading
at best.
III.3. US-ASCII7
The report refers, without any reference or definition, to
"US-ASCII7". There is no such creature, certainly not one
recognized by ANSI (or its predecessors under other names).
Worse, all use of ASCII on the contemporary Internet is in
"NVT" [Klen08 and various predecessor documents] form, i.e.,
ASCII (by definition, 7-bit) right-justified in an 8-bit field
with a leading zero. Introduction of a term like "US-ASCII7"
might cause a reasonable reader to wonder whether whether one
of the historical 7-bit transmission or storage forms was
anticipated.
III.4. ASCII and Language Support
This point has been made many times before, but it is worth
noting that one of the languages that cannot be completely
supported by ASCII characters is called "English". The
problems associated with not having character support beyond
ASCII become worse as one moves from the (extremely few)
languages that can be written entirely in ASCII, to those
(including English) that require a few non-ASCII characters, to
those that are easily mapped into Basic Latin characters using
well-known, recognized, and standardized transliterations, to
those for which transliteration works less well and that use
characters from scripts that are more distantly related to
Basic Latin than most European ones. I don't think the report
needs to dwell on this, but the text reads as if the WG itself
may be confused.
III.5. ITU, UPU, and Other Bodies
In Section 4.4, reference is made to a "UPU E.123 standard..."
with a footnote pointing to an ITU Recommendation. E.123 is,
indeed, an ITU Recommendation. It has little or nothing to do
with the UPU. As with several other items listed here, this
isn't a big deal because the reader who is at all familiar with
the subject matter will figure out what was intended. But it
makes the report look even more like a sloppy piece of work
that was not carefully reviewed by anyone before being
published. There actually are UPU standards; see comment I.4
above.
-----------------
References
[ISO011] There are many of these standards, providing
nationally and internationally approved, unambiguous, methods
for transliterating between various script and "Latin
characters". A complete and up-to-date list can be obtained
by going to the search page at
http://www.iso.org/iso/search/extendedsearchstandards.htm ,
entering "transliteration" under "Keyword or phrase", checking
"Titles" and "Abstracts", setting "Document type" to "All by
default", and initiating a search.
[Klen06] "Issues in Building a Multilingual Internet",
http://www.isoc.org/internet/issues/docs/multilingual-internet-issues_20080408.pdf
[Klen08] RFC 5198, "Unicode Format for Network Interchange". J.
Klensin, M. Padlipsky. March 2008.
[Post83] RFC 854, "Telnet Protocol Specification". J. Postel,
J.K. Reynolds. May 1983.
<<<
Chronological Index
>>> <<<
Thread Index
>>>
|