Comments on whois-accuracy-study-17jan10-en.pdf
- To: whois-accuracy-study@xxxxxxxxx
- Subject: Comments on whois-accuracy-study-17jan10-en.pdf
- From: "Joe St Sauver" <joe@xxxxxxxxxxxxxxxxxx>
- Date: Mon, 5 Apr 2010 17:19:54 -0700 (PDT)
Thank you for the opportunity to comment on the "Draft Report for the
Study of the Accuracy of WHOIS Registrant Contact Information,"
I.1 Whois data accuracy is a critical issue for the Internet as a whole
because the ability to register domain names with incomplete or inaccurate
whois data facilitates spam, scams, phishing, distribution of malware and
numerous other forms of Internet abuse.
I.2 Insuring that domain names are registered with accurate data, and
insuring that that data remains current, is of pivotal importance to
efforts to counter Internet abuse. It is thus encouraging to see ICANN
strive to understand the magnitude and characteristics of whois data
issues, including incomplete and inaccurate data.
I.3 Unfortunately the current whois study has critical deficiencies which
obscure both the true extent of the problem we collectively face and the
steps that are needed to correct that problem.
II.1 The NORC sampling methodology described in Appendix 1 ("Sample Design
in Detail") describes an equal probability sample clustered by country of
II.2 NORC's choice to cluster by country is regretable in that it fails to
fully capture the underlying problem that abuse analysts and researchers
routinely see with the gTLD domain whois data: whois inaccuracy is a
phenomenon which varies dramatically *by registrar*.
II.3 There are some registrars (including some which are large and
others which are small in size) which do a very professional job of
validating and vetting customer identities and their customer's whois
point of contact (POC) information. It is rare to see any domains which
have been registered via those registrars appearing in abusive contexts,
and if abusers do slip by, they are quickly identified and quashed.
II.4 There are other registrars, however, who apparently fail to do even
a cursory job of validating and vetting their customer registration
information. Not surprisingly, those registrars are disproportionately
encountered when investigating incidents of Internet abuse, apparently
because those registrars serve as "registrars of choice" for cyber
miscreants. Moreover, when complaints are made to those registrars about
abusive customers with bogus or incomplete data, those reports often
appear to fall on deaf ears.
II.5 Had the NORC whois study's sampling design employed a weighted
stratified sample (numerically "over sampling" small registrar data
and numerically "under sampling" large registrar data) it would have
been possible to use the resulting data to accurately depict *both*
critical inter-registrar differences while also still producing an
accurate overall estimate for the domain name population as a whole
(that is, an estimate for the population as a whole could have been
obtained by proportionately weighting the estimates obtained for each
of the registrar strata relative to their overall market share).
II.6 Why is it important to use a weighted sample, stratified by registrar,
rather than the alternative sampling plan actually employed?
II.7 While their are over 850 ICANN-accredited registrars, the domain
registration ecosystem is dominated by literally just a handful of large
registrars. In fact, the top *five* registrars account for over half of
all domain registrations. For example,
http://www.webhosting.info/registrars/top-registrars/global/ quotes the
Rank Registrar Market Share
1 Go Daddy 30.002%
2 Enom 8.299%
3 Tucows 6.749%
4 Network Solutions 5.835%
5 Schlund+Partner 4.317% ==> 55.202% total
II.8 Because those five registrars are as large and influential as they
are, we obviously need to understand the accuracy of their customers'
registrations -- these are key players in the domain registration
ecosystem, no argument there.
II.9 However, if we imprudently devote a full 55.202% of our total sample
to *just* those five registrars, we are likely to draw a sample that's far
*larger than needed* to assess the accuracy or inaccuracy associated with
each of those five registrars. Specifically, if we were to draw 2400
domains, we might expect to see:
Go Daddy 30.002% * 2400 ==> ~720 Godaddy-registered domains
Enom 8.299% * 2400 ==> ~199 Enom-registered domains
Tucows 6.749% * 2400 ==> ~162 Tucows-registered domains
Network Solutions 5.835% * 2400 ==> ~140 NS-registered domains
Schlund+Partner 4.317% * 2400 ==> ~104 Sclund-registered domains
II.10 Given that number of observations, we would likely have exceedingly
tight bounds on estimated characteristics for domains from those five
registrars, bounds far tighter than are needed for operationally meaningful
II.11 But what of all the other accredited registrars, except for those top
five? By the time we get down to the 75th most popular registrar out of our
population of over 850 registrar, the 75th most popular registrar's market
share is less than 1/10th of 1%. If we were drawing a proportionate sample,
we might expect to see (on average) only 2.4 domains from each such registrar:
0.1% * 2400 ==> 2.4 domains
II.12 Because we can't take "four tenths" of a domain, many times we'd see
two or three domains in our sample from such a registrar (although obviously
in some cases we might see only one (or none) for a particular registrar,
and other times we might see four or more, it's just the "luck of the draw").
II.13 If we have a sample for a registrar that consists of only two or three
domains, there are only three or four computationally possible outcomes
for each such registrat:
Sample size --> 2 3
Possible outcomes --> 100% accurate 100% accurate
50% accurate 66% accurate
0% accurate 33% accurate
II.14 Intuitively even a non-statistician can understand that such
estimates are unlikely to be very helpful when it comes to understanding
the extent to which there is (or isn't) an accuracy problem *at that
particular registrar* -- you simply need more data (e.g., a larger
sample for that registrar) if you want to be able to draw any meaningful
inferences about the level of whois inaccuracy for that particular
II.15 The problem becomes even more accute when you recognize that by the
time you drop down to the 210th registrar by market share, that 210th
registrar's domain name market share amounts to less than 0.01% (e.g.,
less than 1/100th of 1%). At that point, on average, we'd expect to see
only a quarter of one domain per registrar (in practical terms, we'd
probably see only one domain (at most), and often no domains at all
from such registrars). Is that "okay?" If we care about registrar-by-
registrar differences in whois data accuracy, no. Even "small" registrars
can represent a material source of malicious domains with bad whois data.
II.16 That is, even a registrar that has only 1/100th of 1% of all domains
still represents over 10,000 registered domains. That's a LOT of domains if
they're all being actively abused for malicious purposes and they all have
bogus or incomplete whois POC data. Even small registrars with 1/100th of
1% of all domains may thus still play a critical role when it comes to the
safety, security and stability of the Internet, yet with the current
sampling design, there's virtually no chance that we will see/detect even
blatant hypothetical levels of abuse or neglect by some small registrars.
The design of the current study simply a priori precludes it.
II.17 Some registrars also employ agents, or registration service providers,
who semi-independently register domains through an upstream registrar.
There are often significant differences within a given registrar between
domains registered by one registration service provider and the domains
registered by another registration service provider, even though both
registration service providers employ the service of the same upstream
registrar. At a minimum, the study should identify registrars which
use the registration service provider model, and attempt to quantify the
number of registration service providers each such registrar may support.
Ideally, large registrars employing the registration service provider
model should have individual registration service providers independently
assessed on a registration service provider by registration service provider
II.18 Related to the registrar service provider model, some registration
service providers are, themselves, effectively cloaked, offering at most
a web site or a phone number as a point of contact. This lack of ground
truth identity makes it difficult or impossible for the upstream registrar
(or any other party, including ICANN itself) to hold the registration
service provider accountable for the potential inaccuracy of domains they
may register. At most, the registration service provider may lose the
ability to register domains, but there is virtually nothing to prevent
them from morphing their virtually anonymous online identity to reappear
as a "new" registration service provider. ICANN needs to better understand
the role of registration service providers in addressing whois inaccuracy
and domain abuse.
III. Improving Whois Accuracy
III.1 The real key to identifying and *improving* whois accuracy is letting
the data tell its story. In the current study, ICANN and NORC blindly probed
the population of all domains in an effort to identify those which might
have bad whois data. In reality, ICANN already *knows* about many domains
which likely have bad whois data, and they can readily obtain additional
information about many more domains which likely have bad whois data.
III.2 Let's begin with what ICANN itself knows. ICANN routinely receives
reports about alleged whois inaccuracy via the Whois Data Problem Reporting
System (WDPRS), see http://wdprs.internic.net/ In addition to explicitly
flagging alleged inaccuracies, this data collection channel serves a
potentially important "finger pointing" role: if we consider the set
of all domains reported via WDPRS, are there registrars which are
disproportionately over represented in that data? For example, are
there registrars with 1% of the overall domain market, but 10% or
even 15% of all WDPRS complaints filed? If so, I believe domains
registered via those registrars should be subject to a targeted review
for whois accuracy problems.
III.3 Similarly, WDPRS also collects information via a "followup
survey" for each domain reported via WDPRS. Which registrars clean up
(or "HOLD") the problem domains that are reported to them? Which
registrars appear to ignore WDPRS reports? Again, the source of the
whois inaccuracy problems should quickly become clear.
III.4 If ICANN is willing to reach out to the community for data, a
variety of "domain name block lists" enumerate domains which have
been seen used in the body of spam or in other malicious contexts.
Those domains routinely have bogus whois point of contact information,
with missing or misleading information provided in an effort to hinder
investigation, prosecution and civil litigation.
III.5 Examples of these domain name block lists include the SURBL
(see http://www.surbl.org/ ), the URIBL (see http://www.uribl.com/ )
and the Spamhaus DBL (see http://www.spamhaus.org/dbl/ ). If copies
of those lists are obtained by ICANN and reviewed, what registrars
are disproportionately represented? If the whois data for those
domains is checked, is it correct?
III.6 Using these approaches -- both inhouse and community data --
I believe ICANN can quickly identify the problemtic registrars most
favored by online miscreants, and once that's been done, ICANN could
then focus its efforts where whois-related problems are most acute.
III.7 Yet another approach that ICANN could employ would be to
focus on so-called "bullet proof domain name registration" providers.
These registrars and/or registration service providers are quite candid
in advertising their willingness to allow registrants to register
domains while providing arbitrary/inaccurate point of contact data, or
no point of contact data whatsoever. Providers which appear to offer
this service can be readily found using common search engines. Once
such providers are identified by ICANN, ICANN itself can attempt to
register test domains with bogus or missing data, then mapping the
resulting registrations to the registrars servicing that market,
registrars which are intentionally tolerating inaccurate or missing
POC data in exchange for substantially above-market registration
III.8 Even if ICANN didn't want to pursue any of these alternative
analyses, there are still simple but important steps that ICANN could
take to improve the quality of whois data today.
III.9 For example, automated address verification products exist for
postal addresses in many countries today, including for postal
addresses in the United States. This is acknowledged in the current
whois study: the study's authors used the Smartmailer software to check
whois addresses against USPS records of deliverable addresses (see the
current whois study at PDF page 9). Why is Smartmailer (or a similar
product) used only in extraordinary cases, such as during the conduct
of this whois study? Why isn't this address verification technology
in routine production use for ALL domain registrations at the time
those domains are first registered -- and at any point when they are
III.10 I recognize and acknowledge that address verification data may not
be available for all countries, however the current study indicates that
US addresses account for 59% of all domains (the current whois study
at PDF page 24). Even if mechanical address validation can "only" be done
for 59% of domains, e.g., only for domains registered to U.S. parties,
that is still nothing to be scoffed at, and it is a concrete step that
ICANN can and should undertake immediately even if it takes no other
step to improve whois accuracy. (Obviously automatic validation for
addresses in other countries should be added to the extent that it is
also readily available)
III.11 Another idea for ICANN's consideration: ICANN should consider
shifting its sampling frame from domain *names* to domain *registrants.*
A relatively small number of domain *registrants* register a
disproportionately large number of domain *names* for a variety of
purposes (including domaineering/speculative purposes). Identify those
highest volume registrants, carefully validate their points of contact,
and then, having done so, exclude all the domains associated with that
comparatively small number of registrants now that you've successfully
verified that their whois data is accurate.
III.12 Having done so, you can then continue to work your way down an
ordered list of the highest volume registrants until you're left with
only registrants that have a comparative handful of registrations per
registrant name/email address. By focussing on the registrants who have
the largest number of registered domain names first, you can maximize
the number of domains whose addresses have been verified while
minimizing the number of registrant POC which needs to be scrutinized.
IV. Private/Proxy Registration Services
IV.1 Appendix 3 of the current whois study explicitly addresses the
existence of private/proxy registration services in the study's data,
and as such raises the question of the validity of that data for the
purposes of the Registrar Accreditation Agreement.
IV.2 The boxed description of what constitutes acceptable registrant
point of contact data, as shown on PDF page 8 of the current study, is
"Under Registrar Accreditation Agreement Section 18.104.22.168, an accurate
name and postal address of the registered name holder means that there
is reasonable evidence that the registrant data consists of the
correct name and a valid postal mailing address for the current
registered name holder."
IV.3 Given that definition, I find myself puzzled by the existence of
private/proxy registration services which explicitly state that they will,
as a matter of policy, discard or reject all letters they may receive
addressed to the registrant. Examples of these sort of registrar policies
can be found by Googling for
"private whois" "DO NOT SEND LETTERS"
IV.4 How can any such registrations be deemed to be an acceptable
registration point of contact data given that the private/proxy
registration service providers have declared that as a matter
of policy they will discard or reject all letters they receive for
the registrant? How can an "automatically discarding" delivery address
possibly be considered to be a "valid address" for the purpose of
the registrar accreditation agreement? Attempting to do so simply
defies common sense.
IV.5 Arguing in the alternative, at least some proxy/private registration
service providers *do* forward at least some classes of letters they
receive on behalf of private/proxy registration service customers, a
service that appears to be functionally indistinguishable from the
service of a Commercial Mail Receiving Agency ("CMRA") as described in
the U.S. Postal Service's Domestic Mail Manual (DMM) (see
http://pe.usps.com/text/dmm300/508.htm at section 1.9 et. seq.).
IV.6 In the United States, CMRA businesses are subject to regulation by
the U.S.P.S. Regulations applicable to CMRAs include (among other
a) requirements for the owner/manager of the CMRA to register with the
Post Office on Form 1583-A (see the DMM at 1.9.1 b. and c.),
b) requirements for each addressee customer to register on Form 1583
(see the DMM at 1.9.2. a. et. seq.),
c) requirements for the content/format of the CMRA's postal delivery
addresses (see the DMM at 1.9.2. e. through h.),
d) quarterly reporting requirements (see the DMM at 1.9.3 d.), and
e) limiting the ability of a CMRA to refuse delivery of mail if the
mail is for an addressee who is a current customer or a former
customer (within the past 6 months) (see the DMM at 1.9.3 e.), etc.
IV.7 To the best of my knowledge, U.S. private/proxy domain registration
do NOT comply with U.S.P.S CMRA regulations (presumably because they do
not consider themselves to be CMRAs)
IV.8 If private/proxy whois services *are* determined to be acting as
CMRAs in the U.S. while *NOT* meeting U.S. Postal Service requirements,
that behavior would call into question the validity of those private/proxy
registration service postal addresses for the purpose of compliance with
Registrar Accreditation Agreement, Section 22.214.171.124.
IV.9 I did not see this issue addressed in the current study, despite the
fact that 364 cases (out of the 1419 sampled cases) are private/proxy
registrations. At a minimum, I believe private/proxy registrations in the
study sample should be reviewed to segregate potentially invalid U.S-based
proxy/privacy address registrations (e.g., proxy/private registrations
leveraging unregistered potential CMRAs) from offshore proxy/privacy
registration services which may be subject to alternative regulation (or
no regulation whatsoever).
V. POC Email Addresses
V.1 Beyond the issues this study identified relating to postal addresses,
the email addresses employed in conjunction with domain points of contact
should also be validated since email is often far more practically
important than postal mail when it comes to resolving time sensitive
V.2 If the experience of ARIN with its whois point of contact clean up
project is any guide (see
https://www.arin.net/resources/request/whois_cleanup.html ), a comparatively
small number of domain whois point of contact email addresses may yield a
response when tested.
V.3 For example in ARIN's case, when they tested 29,929 email addresses
associated with ARIN number resources, 1,192 resulted in a response being
received, while 1,854 resulted in a bounce and the remainder may have been
ignored, silently non-delivered, etc.
V.4 I would urge ICANN to undertake a comparable effort to quantify the
accuracy and usability of gTLD domain whois email POC data in addition
to quantifying the accuracy and usability of postal POC data.
VI.1 Thank you for the opportunity to comment on the "Draft Report for the
Study of the Accuracy of WHOIS Registrant Contact Information," and I
hope that ICANN and NORC will address the issues I've identified. I would
be pleased to visit with ICANN staff if they have any questions about any
of the points I've raised in this response.
Joe St Sauver
Disclaimer: all comments expressed are solely the author's and do not
necessarily represent the opinion of any other entity or organization.