On Saturday, 12 February 2011, an incident related to the signature of the .fr zone occurred that had repercussions among most of the users and members of the profession currently involved in the deployment of DNSSEC. In order to allow as large a community as possible to benefit from that experience, we have chosen to describe and analyse the incident in the most detailed way possible in order to serve as a basis for future discussions on the issue.
1. Chronology of the incident
- 14:00 -
The .fr zone becomes inaccessible to any validating resolver outside business hours.
The problem concerns a certain type of record (NSEC3) which is not yet monitored, and for this reason our warning system does not report the incident.
- 15:04 -
We are alerted by other parties using interpersonal communication channels (twitter, phone calls, etc.) and start the diagnostics.
- 15:30 -
We issue reports on the situation and the current state of our analysis several times on twitter (@afnic_op) and via the blog on AFNIC operations (http://operations.nic.fr).
- 16:30 -
The diagnostic, which took a long time to establish (see below), is only partially completed, but is at least sufficient for action to be successfully undertaken to solve the problem.
- 17:50 -
Access as normal to the zone is restored for any validating resolver.
We communicate on the return to service and discuss with various observers how to fine-tune the diagnostic and collect as much information as possible on what was observed and recorded from the outside.
2. Analysis and diagnostic
First of all, it is perhaps worth recalling some of the features of the AFNIC publication architecture.
AFNIC has an hourly refresh system of the .fr zone based on the Dynamic Update in accordance with RFC 2136.
In order to integrate the DNSsec signature mechanism in a manner consistent with this technology, it was decided to use the corresponding Bind features.
In order to have a key and repository management system that would be workable in-house, we chose OpenDnsssec software and deployed a synchronization mechanism with Bind. Finally, we implemented HSM (Hardware Security module) boxes for key and signature storage.
This highly specific architecture led us to implement relatively advanced Bind features that are probably used very little in most of the other signature systems in operation. As a result, we are confronted with incidents that mean we are the first to report bugs involving these functions.
At the time of the signature incident i.e. 14:00, a key (ZSK 43893, which had not signed for a very long time and theoretically could therefore be deleted without any risk) was deleted: it is probably this event which triggered the incident.
When the key which was already inactive was actually deleted, Bind performed a recalculation of the NSEC3 tree which was surprisingly long. During the recalculation, a signature bug occurred as detailed below.
We use version 9.7.1-P2 of BIND in production to dynamically sign the zone to be published. In the case of dynamic signatures (which are used with dynamic updates, but not only), BIND produces records of a private type at the apex (TYPE65534, see the documentation for BIND doc/arm/Bv9ARM.ch04. html, section "Private-type records") which require the updating of NSEC3 type records. It is here that Bind behaves erratically.
The following is an example from the .fr zone on the day of the blackout.
rr: meqimi6fje5ni47pjahv5qigu1lv3jlj.fr. 5400 IN NSEC3
1 1 1 BADFE11A O5SMCS6CUNUQC5RFJ6S94TGGRFH1TVC7 NS SOA TXT NAPTR
RRSIG DNSKEY NSEC3PARAM TYPE65534
sig: meqimi6fje5ni47pjahv5qigu1lv3jlj.fr. 5400 IN RRSIG
NSEC3 8 2 5400 20110408081500 20110207081500 2331
The signature above does not match the NSEC3 record but corresponds to its private type "TYPE65534". BIND added a type to the NSEC3 record without changing the signature, which therefore became invalid.
A bug fairly similar to that observed during the incident was reproduced in the laboratory but in an environment different to that used in production by AFNIC. It affects all the BIND delivered, even the recent version 9.7.3. It has been reported to the ISC [ISC-Bugs # 23232], and recognized and corrected in the source code by patch 3020. The patch is not yet available in an officially delivered version of BIND).
Another bug, much closer to that which caused the blackout, was reproduced in the laboratory under identical AFNIC production conditions. According to our analysis, the patch for the bug would provide a direct solution to the problems we encountered in the .fr zone. To be as detailed and precise as possible, we prefer to push our experimentation even further before issuing a bug report to the ISC. The bug has not yet been tested with versions of BIND other than 9.7.1 and we do not know, for example, if it has been corrected by a more recent version. In addition, apparently the bug no longer affects operations after a period which seems to depend on the size of the zone.
In parallel to the work carried out to correct the bug, we identified and analysed in greater detail the events that occur during the rotation and deletion of old keys. The rotation phases are the ones which seem to trigger the signature incident, due the cross-effect of the Bind bug and the specific nature of our signature architecture.
3. Action plan
Given the worrying nature of the situation, we have seriously studied the possibility of removing the .fr DS records from the root zone to avoid any problem with validating resolvers.
In light of the analysis, it seems that the bug only occurs when keys are deleted during rotations.
We therefore decided to implement the following two-phase strategy:
17 to 21 February: study the Bind bug in greater detail and open a ticket with the ISC
21 February - March 10: implementation of an additional monitoring mechanism based on full validating resolvers before transfer of the zone to slave servers.
x March: delivery and deployment of a patch for the identified bug and global updating of the BINDs.
If the actions scheduled for Phase 1 are insufficient, which does not depend on us, or have to be extended beyond March 21, or if another incident in production on the .fr zone affects the validating resolver, AFNIC may request the deletion of the DS from the root until the software layers used have stabilised, and then revalidation of the signature architecture.
These actions could have an impact on AFNIC's operational action plan, which to date schedules the opening of the DNSsec delegation for 5 April 2010. If appropriate, we shall communicate at a later date on this point.
DNSsec is still a complex form of technology, and not all the software layers used in it have yet been tested in production in all of the configurations. It is therefore understandable that a certain number of bugs may still be encountered. The reactivity of the ISC in particular and the sharing of experience feedback by and between registries nevertheless make us feel confident that this stabilisation period will be as short as possible and will not affect the possibilities of large-scale go-live of DNSsec.
We are interested in developing the sharing of experience feedback with all the registries that have an architecture similar to ours, and in particular those that use Bind with Dynamic updates.
We remind operators of validating resolvers that the operation of DNSSEC is relatively recent, and therefore invite them to remain on their guard, and not hesitate to cooperate with us in preventive or corrective fashion with respect to the implementation of DNSSEC technology.
The address for further discussions on this issue is that of our DNSSEC supervisor, Vincent Levigneron: email@example.com