Comments on whois-accuracy-study-17jan10-en.pdf
Thank you for the opportunity to comment on the "Draft Report for the Study of the Accuracy of WHOIS Registrant Contact Information," http://www.icann.org/en/compliance/reports/whois-accuracy-study-17jan10-en.pdf I. Introduction I.1 Whois data accuracy is a critical issue for the Internet as a whole because the ability to register domain names with incomplete or inaccurate whois data facilitates spam, scams, phishing, distribution of malware and numerous other forms of Internet abuse. I.2 Insuring that domain names are registered with accurate data, and insuring that that data remains current, is of pivotal importance to efforts to counter Internet abuse. It is thus encouraging to see ICANN strive to understand the magnitude and characteristics of whois data issues, including incomplete and inaccurate data. I.3 Unfortunately the current whois study has critical deficiencies which obscure both the true extent of the problem we collectively face and the steps that are needed to correct that problem. II. Methodology II.1 The NORC sampling methodology described in Appendix 1 ("Sample Design in Detail") describes an equal probability sample clustered by country of the registrant. II.2 NORC's choice to cluster by country is regretable in that it fails to fully capture the underlying problem that abuse analysts and researchers routinely see with the gTLD domain whois data: whois inaccuracy is a phenomenon which varies dramatically *by registrar*. II.3 There are some registrars (including some which are large and others which are small in size) which do a very professional job of validating and vetting customer identities and their customer's whois point of contact (POC) information. It is rare to see any domains which have been registered via those registrars appearing in abusive contexts, and if abusers do slip by, they are quickly identified and quashed. II.4 There are other registrars, however, who apparently fail to do even a cursory job of validating and vetting their customer registration information. Not surprisingly, those registrars are disproportionately encountered when investigating incidents of Internet abuse, apparently because those registrars serve as "registrars of choice" for cyber miscreants. Moreover, when complaints are made to those registrars about abusive customers with bogus or incomplete data, those reports often appear to fall on deaf ears. II.5 Had the NORC whois study's sampling design employed a weighted stratified sample (numerically "over sampling" small registrar data and numerically "under sampling" large registrar data) it would have been possible to use the resulting data to accurately depict *both* critical inter-registrar differences while also still producing an accurate overall estimate for the domain name population as a whole (that is, an estimate for the population as a whole could have been obtained by proportionately weighting the estimates obtained for each of the registrar strata relative to their overall market share). II.6 Why is it important to use a weighted sample, stratified by registrar, rather than the alternative sampling plan actually employed? II.7 While their are over 850 ICANN-accredited registrars, the domain registration ecosystem is dominated by literally just a handful of large registrars. In fact, the top *five* registrars account for over half of all domain registrations. For example, http://www.webhosting.info/registrars/top-registrars/global/ quotes the values: Rank Registrar Market Share 1 Go Daddy 30.002% 2 Enom 8.299% 3 Tucows 6.749% 4 Network Solutions 5.835% 5 Schlund+Partner 4.317% ==> 55.202% total II.8 Because those five registrars are as large and influential as they are, we obviously need to understand the accuracy of their customers' registrations -- these are key players in the domain registration ecosystem, no argument there. II.9 However, if we imprudently devote a full 55.202% of our total sample to *just* those five registrars, we are likely to draw a sample that's far *larger than needed* to assess the accuracy or inaccuracy associated with each of those five registrars. Specifically, if we were to draw 2400 domains, we might expect to see: Go Daddy 30.002% * 2400 ==> ~720 Godaddy-registered domains Enom 8.299% * 2400 ==> ~199 Enom-registered domains Tucows 6.749% * 2400 ==> ~162 Tucows-registered domains Network Solutions 5.835% * 2400 ==> ~140 NS-registered domains Schlund+Partner 4.317% * 2400 ==> ~104 Sclund-registered domains II.10 Given that number of observations, we would likely have exceedingly tight bounds on estimated characteristics for domains from those five registrars, bounds far tighter than are needed for operationally meaningful purposes. II.11 But what of all the other accredited registrars, except for those top five? By the time we get down to the 75th most popular registrar out of our population of over 850 registrar, the 75th most popular registrar's market share is less than 1/10th of 1%. If we were drawing a proportionate sample, we might expect to see (on average) only 2.4 domains from each such registrar: 0.1% * 2400 ==> 2.4 domains II.12 Because we can't take "four tenths" of a domain, many times we'd see two or three domains in our sample from such a registrar (although obviously in some cases we might see only one (or none) for a particular registrar, and other times we might see four or more, it's just the "luck of the draw"). II.13 If we have a sample for a registrar that consists of only two or three domains, there are only three or four computationally possible outcomes for each such registrat: Sample size --> 2 3 Possible outcomes --> 100% accurate 100% accurate 50% accurate 66% accurate 0% accurate 33% accurate 0% accurate II.14 Intuitively even a non-statistician can understand that such estimates are unlikely to be very helpful when it comes to understanding the extent to which there is (or isn't) an accuracy problem *at that particular registrar* -- you simply need more data (e.g., a larger sample for that registrar) if you want to be able to draw any meaningful inferences about the level of whois inaccuracy for that particular registrar. II.15 The problem becomes even more accute when you recognize that by the time you drop down to the 210th registrar by market share, that 210th registrar's domain name market share amounts to less than 0.01% (e.g., less than 1/100th of 1%). At that point, on average, we'd expect to see only a quarter of one domain per registrar (in practical terms, we'd probably see only one domain (at most), and often no domains at all from such registrars). Is that "okay?" If we care about registrar-by- registrar differences in whois data accuracy, no. Even "small" registrars can represent a material source of malicious domains with bad whois data. II.16 That is, even a registrar that has only 1/100th of 1% of all domains still represents over 10,000 registered domains. That's a LOT of domains if they're all being actively abused for malicious purposes and they all have bogus or incomplete whois POC data. Even small registrars with 1/100th of 1% of all domains may thus still play a critical role when it comes to the safety, security and stability of the Internet, yet with the current sampling design, there's virtually no chance that we will see/detect even blatant hypothetical levels of abuse or neglect by some small registrars. The design of the current study simply a priori precludes it. II.17 Some registrars also employ agents, or registration service providers, who semi-independently register domains through an upstream registrar. There are often significant differences within a given registrar between domains registered by one registration service provider and the domains registered by another registration service provider, even though both registration service providers employ the service of the same upstream registrar. At a minimum, the study should identify registrars which use the registration service provider model, and attempt to quantify the number of registration service providers each such registrar may support. Ideally, large registrars employing the registration service provider model should have individual registration service providers independently assessed on a registration service provider by registration service provider basis. II.18 Related to the registrar service provider model, some registration service providers are, themselves, effectively cloaked, offering at most a web site or a phone number as a point of contact. This lack of ground truth identity makes it difficult or impossible for the upstream registrar (or any other party, including ICANN itself) to hold the registration service provider accountable for the potential inaccuracy of domains they may register. At most, the registration service provider may lose the ability to register domains, but there is virtually nothing to prevent them from morphing their virtually anonymous online identity to reappear as a "new" registration service provider. ICANN needs to better understand the role of registration service providers in addressing whois inaccuracy and domain abuse. III. Improving Whois Accuracy III.1 The real key to identifying and *improving* whois accuracy is letting the data tell its story. In the current study, ICANN and NORC blindly probed the population of all domains in an effort to identify those which might have bad whois data. In reality, ICANN already *knows* about many domains which likely have bad whois data, and they can readily obtain additional information about many more domains which likely have bad whois data. III.2 Let's begin with what ICANN itself knows. ICANN routinely receives reports about alleged whois inaccuracy via the Whois Data Problem Reporting System (WDPRS), see http://wdprs.internic.net/ In addition to explicitly flagging alleged inaccuracies, this data collection channel serves a potentially important "finger pointing" role: if we consider the set of all domains reported via WDPRS, are there registrars which are disproportionately over represented in that data? For example, are there registrars with 1% of the overall domain market, but 10% or even 15% of all WDPRS complaints filed? If so, I believe domains registered via those registrars should be subject to a targeted review for whois accuracy problems. III.3 Similarly, WDPRS also collects information via a "followup survey" for each domain reported via WDPRS. Which registrars clean up (or "HOLD") the problem domains that are reported to them? Which registrars appear to ignore WDPRS reports? Again, the source of the whois inaccuracy problems should quickly become clear. III.4 If ICANN is willing to reach out to the community for data, a variety of "domain name block lists" enumerate domains which have been seen used in the body of spam or in other malicious contexts. Those domains routinely have bogus whois point of contact information, with missing or misleading information provided in an effort to hinder investigation, prosecution and civil litigation. III.5 Examples of these domain name block lists include the SURBL (see http://www.surbl.org/ ), the URIBL (see http://www.uribl.com/ ) and the Spamhaus DBL (see http://www.spamhaus.org/dbl/ ). If copies of those lists are obtained by ICANN and reviewed, what registrars are disproportionately represented? If the whois data for those domains is checked, is it correct? III.6 Using these approaches -- both inhouse and community data -- I believe ICANN can quickly identify the problemtic registrars most favored by online miscreants, and once that's been done, ICANN could then focus its efforts where whois-related problems are most acute. III.7 Yet another approach that ICANN could employ would be to focus on so-called "bullet proof domain name registration" providers. These registrars and/or registration service providers are quite candid in advertising their willingness to allow registrants to register domains while providing arbitrary/inaccurate point of contact data, or no point of contact data whatsoever. Providers which appear to offer this service can be readily found using common search engines. Once such providers are identified by ICANN, ICANN itself can attempt to register test domains with bogus or missing data, then mapping the resulting registrations to the registrars servicing that market, registrars which are intentionally tolerating inaccurate or missing POC data in exchange for substantially above-market registration fees. III.8 Even if ICANN didn't want to pursue any of these alternative analyses, there are still simple but important steps that ICANN could take to improve the quality of whois data today. III.9 For example, automated address verification products exist for postal addresses in many countries today, including for postal addresses in the United States. This is acknowledged in the current whois study: the study's authors used the Smartmailer software to check whois addresses against USPS records of deliverable addresses (see the current whois study at PDF page 9). Why is Smartmailer (or a similar product) used only in extraordinary cases, such as during the conduct of this whois study? Why isn't this address verification technology in routine production use for ALL domain registrations at the time those domains are first registered -- and at any point when they are subsequently modified? III.10 I recognize and acknowledge that address verification data may not be available for all countries, however the current study indicates that US addresses account for 59% of all domains (the current whois study at PDF page 24). Even if mechanical address validation can "only" be done for 59% of domains, e.g., only for domains registered to U.S. parties, that is still nothing to be scoffed at, and it is a concrete step that ICANN can and should undertake immediately even if it takes no other step to improve whois accuracy. (Obviously automatic validation for addresses in other countries should be added to the extent that it is also readily available) III.11 Another idea for ICANN's consideration: ICANN should consider shifting its sampling frame from domain *names* to domain *registrants.* A relatively small number of domain *registrants* register a disproportionately large number of domain *names* for a variety of purposes (including domaineering/speculative purposes). Identify those highest volume registrants, carefully validate their points of contact, and then, having done so, exclude all the domains associated with that comparatively small number of registrants now that you've successfully verified that their whois data is accurate. III.12 Having done so, you can then continue to work your way down an ordered list of the highest volume registrants until you're left with only registrants that have a comparative handful of registrations per registrant name/email address. By focussing on the registrants who have the largest number of registered domain names first, you can maximize the number of domains whose addresses have been verified while minimizing the number of registrant POC which needs to be scrutinized. IV. Private/Proxy Registration Services IV.1 Appendix 3 of the current whois study explicitly addresses the existence of private/proxy registration services in the study's data, and as such raises the question of the validity of that data for the purposes of the Registrar Accreditation Agreement. IV.2 The boxed description of what constitutes acceptable registrant point of contact data, as shown on PDF page 8 of the current study, is listed as: "Under Registrar Accreditation Agreement Section 126.96.36.199, an accurate name and postal address of the registered name holder means that there is reasonable evidence that the registrant data consists of the correct name and a valid postal mailing address for the current registered name holder." IV.3 Given that definition, I find myself puzzled by the existence of private/proxy registration services which explicitly state that they will, as a matter of policy, discard or reject all letters they may receive addressed to the registrant. Examples of these sort of registrar policies can be found by Googling for "private whois" "DO NOT SEND LETTERS" IV.4 How can any such registrations be deemed to be an acceptable registration point of contact data given that the private/proxy registration service providers have declared that as a matter of policy they will discard or reject all letters they receive for the registrant? How can an "automatically discarding" delivery address possibly be considered to be a "valid address" for the purpose of the registrar accreditation agreement? Attempting to do so simply defies common sense. IV.5 Arguing in the alternative, at least some proxy/private registration service providers *do* forward at least some classes of letters they receive on behalf of private/proxy registration service customers, a service that appears to be functionally indistinguishable from the service of a Commercial Mail Receiving Agency ("CMRA") as described in the U.S. Postal Service's Domestic Mail Manual (DMM) (see http://pe.usps.com/text/dmm300/508.htm at section 1.9 et. seq.). IV.6 In the United States, CMRA businesses are subject to regulation by the U.S.P.S. Regulations applicable to CMRAs include (among other things): a) requirements for the owner/manager of the CMRA to register with the Post Office on Form 1583-A (see the DMM at 1.9.1 b. and c.), b) requirements for each addressee customer to register on Form 1583 (see the DMM at 1.9.2. a. et. seq.), c) requirements for the content/format of the CMRA's postal delivery addresses (see the DMM at 1.9.2. e. through h.), d) quarterly reporting requirements (see the DMM at 1.9.3 d.), and e) limiting the ability of a CMRA to refuse delivery of mail if the mail is for an addressee who is a current customer or a former customer (within the past 6 months) (see the DMM at 1.9.3 e.), etc. IV.7 To the best of my knowledge, U.S. private/proxy domain registration do NOT comply with U.S.P.S CMRA regulations (presumably because they do not consider themselves to be CMRAs) IV.8 If private/proxy whois services *are* determined to be acting as CMRAs in the U.S. while *NOT* meeting U.S. Postal Service requirements, that behavior would call into question the validity of those private/proxy registration service postal addresses for the purpose of compliance with Registrar Accreditation Agreement, Section 188.8.131.52. IV.9 I did not see this issue addressed in the current study, despite the fact that 364 cases (out of the 1419 sampled cases) are private/proxy registrations. At a minimum, I believe private/proxy registrations in the study sample should be reviewed to segregate potentially invalid U.S-based proxy/privacy address registrations (e.g., proxy/private registrations leveraging unregistered potential CMRAs) from offshore proxy/privacy registration services which may be subject to alternative regulation (or no regulation whatsoever). V. POC Email Addresses V.1 Beyond the issues this study identified relating to postal addresses, the email addresses employed in conjunction with domain points of contact should also be validated since email is often far more practically important than postal mail when it comes to resolving time sensitive issues. V.2 If the experience of ARIN with its whois point of contact clean up project is any guide (see https://www.arin.net/resources/request/whois_cleanup.html ), a comparatively small number of domain whois point of contact email addresses may yield a response when tested. V.3 For example in ARIN's case, when they tested 29,929 email addresses associated with ARIN number resources, 1,192 resulted in a response being received, while 1,854 resulted in a bounce and the remainder may have been ignored, silently non-delivered, etc. V.4 I would urge ICANN to undertake a comparable effort to quantify the accuracy and usability of gTLD domain whois email POC data in addition to quantifying the accuracy and usability of postal POC data. VI. Summary/Conclusion VI.1 Thank you for the opportunity to comment on the "Draft Report for the Study of the Accuracy of WHOIS Registrant Contact Information," and I hope that ICANN and NORC will address the issues I've identified. I would be pleased to visit with ICANN staff if they have any questions about any of the points I've raised in this response. Sincerely, Joe St Sauver Disclaimer: all comments expressed are solely the author's and do not necessarily represent the opinion of any other entity or organization.