ICANN ICANN Email List Archives

[whois-accuracy-study]


<<< Chronological Index >>>    <<< Thread Index >>>

Comments on whois-accuracy-study-17jan10-en.pdf

  • To: whois-accuracy-study@xxxxxxxxx
  • Subject: Comments on whois-accuracy-study-17jan10-en.pdf
  • From: "Joe St Sauver" <joe@xxxxxxxxxxxxxxxxxx>
  • Date: Mon, 5 Apr 2010 17:19:54 -0700 (PDT)

Thank you for the opportunity to comment on the "Draft Report for the
Study of the Accuracy of WHOIS Registrant Contact Information,"
http://www.icann.org/en/compliance/reports/whois-accuracy-study-17jan10-en.pdf

I. Introduction

I.1 Whois data accuracy is a critical issue for the Internet as a whole
because the ability to register domain names with incomplete or inaccurate 
whois data facilitates spam, scams, phishing, distribution of malware and 
numerous other forms of Internet abuse.

I.2 Insuring that domain names are registered with accurate data, and 
insuring that that data remains current, is of pivotal importance to 
efforts to counter Internet abuse. It is thus encouraging to see ICANN 
strive to understand the magnitude and characteristics of whois data 
issues, including incomplete and inaccurate data.

I.3 Unfortunately the current whois study has critical deficiencies which 
obscure both the true extent of the problem we collectively face and the 
steps that are needed to correct that problem.

II. Methodology

II.1 The NORC sampling methodology described in Appendix 1 ("Sample Design 
in Detail") describes an equal probability sample clustered by country of
the registrant.

II.2 NORC's choice to cluster by country is regretable in that it fails to 
fully capture the underlying problem that abuse analysts and researchers 
routinely see with the gTLD domain whois data: whois inaccuracy is a 
phenomenon which varies dramatically *by registrar*.

II.3 There are some registrars (including some which are large and 
others which are small in size) which do a very professional job of 
validating and vetting customer identities and their customer's whois 
point of contact (POC) information. It is rare to see any domains which 
have been registered via those registrars appearing in abusive contexts,
and if abusers do slip by, they are quickly identified and quashed.

II.4 There are other registrars, however, who apparently fail to do even 
a cursory job of validating and vetting their customer registration 
information. Not surprisingly, those registrars are disproportionately
encountered when investigating incidents of Internet abuse, apparently
because those registrars serve as "registrars of choice" for cyber 
miscreants. Moreover, when complaints are made to those registrars about
abusive customers with bogus or incomplete data, those reports often
appear to fall on deaf ears. 

II.5 Had the NORC whois study's sampling design employed a weighted 
stratified sample (numerically "over sampling" small registrar data 
and numerically "under sampling" large registrar data) it would have 
been possible to use the resulting data to accurately depict *both*
critical inter-registrar differences while also still producing an 
accurate overall estimate for the domain name population as a whole 
(that is, an estimate for the population as a whole could have been 
obtained by proportionately weighting the estimates obtained for each 
of the registrar strata relative to their overall market share). 

II.6 Why is it important to use a weighted sample, stratified by registrar, 
rather than the alternative sampling plan actually employed?

II.7 While their are over 850 ICANN-accredited registrars, the domain 
registration ecosystem is dominated by literally just a handful of large 
registrars. In fact, the top *five* registrars account for over half of 
all domain registrations. For example,
http://www.webhosting.info/registrars/top-registrars/global/ quotes the 
values:

   Rank   Registrar       Market Share
   1      Go Daddy             30.002%
   2      Enom                  8.299%
   3      Tucows                6.749%
   4      Network Solutions     5.835%
   5      Schlund+Partner       4.317%  ==> 55.202% total

II.8 Because those five registrars are as large and influential as they 
are, we obviously need to understand the accuracy of their customers' 
registrations -- these are key players in the domain registration 
ecosystem, no argument there. 

II.9 However, if we imprudently devote a full 55.202% of our total sample 
to *just* those five registrars, we are likely to draw a sample that's far
*larger than needed* to assess the accuracy or inaccuracy associated with 
each of those five registrars. Specifically, if we were to draw 2400 
domains, we might expect to see:

   Go Daddy             30.002% * 2400 ==> ~720 Godaddy-registered domains
   Enom                  8.299% * 2400 ==> ~199 Enom-registered domains
   Tucows                6.749% * 2400 ==> ~162 Tucows-registered domains
   Network Solutions     5.835% * 2400 ==> ~140 NS-registered domains
   Schlund+Partner       4.317% * 2400 ==> ~104 Sclund-registered domains

II.10 Given that number of observations, we would likely have exceedingly 
tight bounds on estimated characteristics for domains from those five 
registrars, bounds far tighter than are needed for operationally meaningful 
purposes.

II.11 But what of all the other accredited registrars, except for those top 
five? By the time we get down to the 75th most popular registrar out of our 
population of over 850 registrar, the 75th most popular registrar's market 
share is less than 1/10th of 1%. If we were drawing a proportionate sample, 
we might expect to see (on average) only 2.4 domains from each such registrar:

                         0.1% * 2400 ==> 2.4 domains

II.12 Because we can't take "four tenths" of a domain, many times we'd see
two or three domains in our sample from such a registrar (although obviously 
in some cases we might see only one (or none) for a particular registrar,
and other times we might see four or more, it's just the "luck of the draw"). 

II.13 If we have a sample for a registrar that consists of only two or three 
domains, there are only three or four computationally possible outcomes 
for each such registrat:

Sample size -->            2                         3

Possible outcomes -->      100% accurate             100% accurate
                           50% accurate              66% accurate
                           0% accurate               33% accurate
                                                     0% accurate

II.14 Intuitively even a non-statistician can understand that such 
estimates are unlikely to be very helpful when it comes to understanding 
the extent to which there is (or isn't) an accuracy problem *at that
particular registrar* -- you simply need more data (e.g., a larger 
sample for that registrar) if you want to be able to draw any meaningful 
inferences about the level of whois inaccuracy for that particular 
registrar.

II.15 The problem becomes even more accute when you recognize that by the
time you drop down to the 210th registrar by market share, that 210th
registrar's domain name market share amounts to less than 0.01% (e.g., 
less than 1/100th of 1%). At that point, on average, we'd expect to see 
only a quarter of one domain per registrar (in practical terms, we'd 
probably see only one domain (at most), and often no domains at all 
from such registrars). Is that "okay?" If we care about registrar-by-
registrar differences in whois data accuracy, no. Even "small" registrars
can represent a material source of malicious domains with bad whois data.

II.16 That is, even a registrar that has only 1/100th of 1% of all domains 
still represents over 10,000 registered domains. That's a LOT of domains if
they're all being actively abused for malicious purposes and they all have
bogus or incomplete whois POC data. Even small registrars with 1/100th of 
1% of all domains may thus still play a critical role when it comes to the 
safety, security and stability of the Internet, yet with the current 
sampling design, there's virtually no chance that we will see/detect even 
blatant hypothetical levels of abuse or neglect by some small registrars.
The design of the current study simply a priori precludes it.

II.17 Some registrars also employ agents, or registration service providers,
who semi-independently register domains through an upstream registrar. 
There are often significant differences within a given registrar between
domains registered by one registration service provider and the domains
registered by another registration service provider, even though both
registration service providers employ the service of the same upstream
registrar. At a minimum, the study should identify registrars which
use the registration service provider model, and attempt to quantify the
number of registration service providers each such registrar may support.
Ideally, large registrars employing the registration service provider
model should have individual registration service providers independently
assessed on a registration service provider by registration service provider
basis.

II.18 Related to the registrar service provider model, some registration
service providers are, themselves, effectively cloaked, offering at most
a web site or a phone number as a point of contact. This lack of ground
truth identity makes it difficult or impossible for the upstream registrar 
(or any other party, including ICANN itself) to hold the registration 
service provider accountable for the potential inaccuracy of domains they 
may register. At most, the registration service provider may lose the 
ability to register domains, but there is virtually nothing to prevent
them from morphing their virtually anonymous online identity to reappear
as a "new" registration service provider. ICANN needs to better understand
the role of registration service providers in addressing whois inaccuracy
and domain abuse. 

III. Improving Whois Accuracy

III.1 The real key to identifying and *improving* whois accuracy is letting 
the data tell its story. In the current study, ICANN and NORC blindly probed 
the population of all domains in an effort to identify those which might 
have bad whois data. In reality, ICANN already *knows* about many domains 
which likely have bad whois data, and they can readily obtain additional
information about many more domains which likely have bad whois data.

III.2 Let's begin with what ICANN itself knows. ICANN routinely receives 
reports about alleged whois inaccuracy via the Whois Data Problem Reporting 
System (WDPRS), see http://wdprs.internic.net/  In addition to explicitly 
flagging alleged inaccuracies, this data collection channel serves a 
potentially important "finger pointing" role: if we consider the set 
of all domains reported via WDPRS, are there registrars which are 
disproportionately over represented in that data? For example, are 
there registrars with 1% of the overall domain market, but 10% or 
even 15% of all WDPRS complaints filed? If so, I believe domains 
registered via those registrars should be subject to a targeted review 
for whois accuracy problems.

III.3 Similarly, WDPRS also collects information via a "followup 
survey" for each domain reported via WDPRS. Which registrars clean up 
(or "HOLD") the problem domains that are reported to them? Which 
registrars appear to ignore WDPRS reports? Again, the source of the 
whois inaccuracy problems should quickly become clear.

III.4 If ICANN is willing to reach out to the community for data, a 
variety of "domain name block lists" enumerate domains which have 
been seen used in the body of spam or in other malicious contexts. 
Those domains routinely have bogus whois point of contact information, 
with missing or misleading information provided in an effort to hinder 
investigation, prosecution and civil litigation. 

III.5 Examples of these domain name block lists include the SURBL 
(see http://www.surbl.org/ ), the URIBL (see http://www.uribl.com/ )
and the Spamhaus DBL (see http://www.spamhaus.org/dbl/ ). If copies
of those lists are obtained by ICANN and reviewed, what registrars 
are disproportionately represented? If the whois data for those
domains is checked, is it correct?

III.6 Using these approaches -- both inhouse and community data --
I believe ICANN can quickly identify the problemtic registrars most 
favored by online miscreants, and once that's been done, ICANN could 
then focus its efforts where whois-related problems are most acute.

III.7 Yet another approach that ICANN could employ would be to 
focus on so-called "bullet proof domain name registration" providers. 
These registrars and/or registration service providers are quite candid 
in advertising their willingness to allow registrants to register 
domains while providing arbitrary/inaccurate point of contact data, or 
no point of contact data whatsoever. Providers which appear to offer 
this service can be readily found using common search engines. Once 
such providers are identified by ICANN, ICANN itself can attempt to 
register test domains with bogus or missing data, then mapping the 
resulting registrations to the registrars servicing that market, 
registrars which are intentionally tolerating inaccurate or missing 
POC data in exchange for substantially above-market registration
fees.

III.8 Even if ICANN didn't want to pursue any of these alternative
analyses, there are still simple but important steps that ICANN could 
take to improve the quality of whois data today.

III.9 For example, automated address verification products exist for 
postal addresses in many countries today, including for postal 
addresses in the United States. This is acknowledged in the current 
whois study: the study's authors used the Smartmailer software to check 
whois addresses against USPS records of deliverable addresses (see the 
current whois study at PDF page 9). Why is Smartmailer (or a similar
product) used only in extraordinary cases, such as during the conduct 
of this whois study? Why isn't this address verification technology 
in routine production use for ALL domain registrations at the time 
those domains are first registered -- and at any point when they are 
subsequently modified?

III.10 I recognize and acknowledge that address verification data may not
be available for all countries, however the current study indicates that
US addresses account for 59% of all domains (the current whois study 
at PDF page 24). Even if mechanical address validation can "only" be done
for 59% of domains, e.g., only for domains registered to U.S. parties, 
that is still nothing to be scoffed at, and it is a concrete step that 
ICANN can and should undertake immediately even if it takes no other 
step to improve whois accuracy. (Obviously automatic validation for 
addresses in other countries should be added to the extent that it is
also readily available)

III.11 Another idea for ICANN's consideration: ICANN should consider 
shifting its sampling frame from domain *names* to domain *registrants.* 
A relatively small number of domain *registrants* register a 
disproportionately large number of domain *names* for a variety of 
purposes (including domaineering/speculative purposes). Identify those 
highest volume registrants, carefully validate their points of contact, 
and then, having done so, exclude all the domains associated with that 
comparatively small number of registrants now that you've successfully 
verified that their whois data is accurate.

III.12 Having done so, you can then continue to work your way down an 
ordered list of the highest volume registrants until you're left with 
only registrants that have a comparative handful of registrations per 
registrant name/email address. By focussing on the registrants who have 
the largest number of registered domain names first, you can maximize 
the number of domains whose addresses have been verified while 
minimizing the number of registrant POC which needs to be scrutinized.

IV. Private/Proxy Registration Services 

IV.1 Appendix 3 of the current whois study explicitly addresses the 
existence of private/proxy registration services in the study's data,
and as such raises the question of the validity of that data for the
purposes of the Registrar Accreditation Agreement. 

IV.2 The boxed description of what constitutes acceptable registrant 
point of contact data, as shown on PDF page 8 of the current study, is
listed as:

   "Under Registrar Accreditation Agreement Section 3.3.1.6, an accurate
   name and postal address of the registered name holder means that there
   is reasonable evidence that the registrant data consists of the
   correct name and a valid postal mailing address for the current
   registered name holder."

IV.3 Given that definition, I find myself puzzled by the existence of 
private/proxy registration services which explicitly state that they will,
as a matter of policy, discard or reject all letters they may receive 
addressed to the registrant. Examples of these sort of registrar policies 
can be found by Googling for

   "private whois" "DO NOT SEND LETTERS"

IV.4 How can any such registrations be deemed to be an acceptable 
registration point of contact data given that the private/proxy 
registration service providers have declared that as a matter 
of policy they will discard or reject all letters they receive for 
the registrant? How can an "automatically discarding" delivery address
possibly be considered to be a "valid address" for the purpose of
the registrar accreditation agreement? Attempting to do so simply
defies common sense.

IV.5 Arguing in the alternative, at least some proxy/private registration 
service providers *do* forward at least some classes of letters they 
receive on behalf of private/proxy registration service customers, a 
service that appears to be functionally indistinguishable from the 
service of a Commercial Mail Receiving Agency ("CMRA") as described in 
the U.S. Postal Service's Domestic Mail Manual (DMM) (see 
http://pe.usps.com/text/dmm300/508.htm at section 1.9 et. seq.). 

IV.6 In the United States, CMRA businesses are subject to regulation by 
the U.S.P.S.  Regulations applicable to CMRAs include (among other 
things):

a) requirements for the owner/manager of the CMRA to register with the 
   Post Office on Form 1583-A (see the DMM at 1.9.1 b. and c.),

b) requirements for each addressee customer to register on Form 1583
   (see the DMM at 1.9.2. a. et. seq.),

c) requirements for the content/format of the CMRA's postal delivery 
   addresses (see the DMM at 1.9.2. e. through h.),

d) quarterly reporting requirements (see the DMM at 1.9.3 d.), and 

e) limiting the ability of a CMRA to refuse delivery of mail if the
   mail is for an addressee who is a current customer or a former 
   customer (within the past 6 months) (see the DMM at 1.9.3 e.), etc.

IV.7 To the best of my knowledge, U.S. private/proxy domain registration
do NOT comply with U.S.P.S CMRA regulations (presumably because they do
not consider themselves to be CMRAs)

IV.8 If private/proxy whois services *are* determined to be acting as 
CMRAs in the U.S. while *NOT* meeting U.S. Postal Service requirements, 
that behavior would call into question the validity of those private/proxy 
registration service postal addresses for the purpose of compliance with 
Registrar Accreditation Agreement, Section 3.3.1.6. 

IV.9 I did not see this issue addressed in the current study, despite the
fact that 364 cases (out of the 1419 sampled cases) are private/proxy
registrations. At a minimum, I believe private/proxy registrations in the
study sample should be reviewed to segregate potentially invalid U.S-based
proxy/privacy address registrations (e.g., proxy/private registrations
leveraging unregistered potential CMRAs) from offshore proxy/privacy 
registration services which may be subject to alternative regulation (or 
no regulation whatsoever).

V. POC Email Addresses

V.1 Beyond the issues this study identified relating to postal addresses,
the email addresses employed in conjunction with domain points of contact
should also be validated since email is often far more practically
important than postal mail when it comes to resolving time sensitive
issues.

V.2 If the experience of ARIN with its whois point of contact clean up
project is any guide (see 
https://www.arin.net/resources/request/whois_cleanup.html ), a comparatively
small number of domain whois point of contact email addresses may yield a 
response when tested. 

V.3 For example in ARIN's case, when they tested 29,929 email addresses 
associated with ARIN number resources, 1,192 resulted in a response being 
received, while 1,854 resulted in a bounce and the remainder may have been 
ignored, silently non-delivered, etc. 

V.4 I would urge ICANN to undertake a comparable effort to quantify the
accuracy and usability of gTLD domain whois email POC data in addition 
to quantifying the accuracy and usability of postal POC data.

VI. Summary/Conclusion

VI.1 Thank you for the opportunity to comment on the "Draft Report for the
Study of the Accuracy of WHOIS Registrant Contact Information," and I
hope that ICANN and NORC will address the issues I've identified. I would
be pleased to visit with ICANN staff if they have any questions about any
of the points I've raised in this response.

Sincerely,

Joe St Sauver

Disclaimer: all comments expressed are solely the author's and do not
necessarily represent the opinion of any other entity or organization.


<<< Chronological Index >>>    <<< Thread Index >>>

Privacy Policy | Terms of Service | Cookies Policy