Some thoughts on root key roll-over
Although I'm not a big player in DNSSEC, I do follow it with interest (and use it for my tiny domains.) However, I do have experience with deploying and testing large systems. I hope these thoughts will be useful.
I agree with the 4 comments posted so far, and strongly support the notion advanced by Steve Crocker that (a) we're (**) late doing this and (b) it needs to be done multiple times.
However, it's unlikely to work the first time - and if it does, there will be fear that it won't. And we're making progress getting DNSSEC adopted. If we disrupt what's working, we'll discourage and/or delay adoption, and perhaps even see some regression in its acceptance. Further, this concern is likely to result in less testing than is required for complete test coverage of the process and environment.
It occurs to me that an intermediate step might increase both confidence and test coverage. Suppose a few new servers (or IP addresses on existing servers) were allocated as mirrors of the root zone. These would receive all updates to zone data in (near) real-time, but the root key roll-over process would be tested on these mirrors.
These server addresses could be published to enable testing of the procedures and the resolver software in an orderly fashion. This would allow the key stakeholders (and the adventurous public) to *opt-in* to the testing phase, and allow testing the process/resolvers to be more extensive that might otherwise happen. For example, the plan might take risks with edge cases or intentional mis-configurations that wouldn't be acceptable for the production root. And we should definitely test algorithm introduction and roll-over - not just key updates.
With the results of this testing published, and any deficiencies corrected, testing in the production root should be a non-event.
This isn't a trivial undertaking, but seems to provide significant advantages over simply attempting to update the existing key on the production root.
Some things to think about in such a plan (in no particular order):o Capacity. Perhaps these servers should serve registered IP addresses only (e.g. those who sign-up on a testing website) to contain costs/deal with vandals(*). Aside from vandals, a resolver will be configured either with the production root servers or the test servers, so net demand should be constant. However, to ensure adequate test coverage, this needs to be an easily permeable barrier.
o Configuration. Need to ensure that the mix of anycast; geography; latency mixes expose the same timing issues as the production root. Ensure testing plans include what happens when a server/servers are down at critical times, and come back at inopportune ones.
o Phase-in and phase-out plans - once testing is complete, the plan for turning off the test servers needs to accommodate the change control schedules of testers. As processes are validated and corrected, how do they get rolled-into the production root? I suggest incrementally, not a big bang.
o Make it easy to opt-in - provide root server files that are 'drop-in' to the known resolver software; perhaps provide implementation scripts/configuration stanzas for the known resolvers.
o Make it easy to provide feedback. And acknowledge/follow-up feedback. Nothing turns off feedback faster than that "I'm wasting my time shouting into this barrel" feeling...
o Plan for gathering data - not just about what the root sees, but about distribution/implementation of any resolver or registrar or other tools that are tested, and those that are discovered to need updates. Automate where possible.
o Plan for the support resources necessary to diagnose and roll-back any changes that cause problems quickly. Expect that some testers will - or we hope they will - do non-trivial testing. So this can't be a toy environment with non-deterministic support. It needs to be run as if it was the production root in most respects.
o Plan for communication - both passive (e.g. a website/wiki) and push (e.g. an e-mail list). Announce tests and results so those using the alternate root can plan, and feel engaged.
I'm sure there are other considerations, but those more involved in the implementation will doubtless point them out.
This note doesn't pretend to be a complete plan, but rather is intended to stimulate thought.
(*) vandal - those who think DDOS attacks on infrastructure are an acceptable form of entertainment. (**) As noted, I'm not much of a player in this space, so 'we' is used from habit and to indicate interest in the outcome.
-- Timothe Litt ACM Distinguished Engineer -------------------------- This communication may not represent the ACM or my employer's views, if any, on the matters discussed.