Last week, MyNic suffered a massive outage taking out any website that had a
.my domain, including local banks like maybank2u.com.my and even government websites hosted on
Here’s a great report on what happened from IANIX. I’m no DNSSEC expert, but here’s my laymen reading of what happened:
- Up to 11-Jun,
.myused a DNSKEY with
- For some reason, this key went missing on the 15-Jun, and was replaced with DNSKEY
key tag:63366. Which is still a valid SEP for
- Unfortunately, the DS record on root, was still pointing to
- So DNSSEC starting failing
- 15 hours later, instead of correcting the error, someone tried to switch off DNSSEC removing all the signatures (RRSIG)
- But this didn’t work, as the parent zone still had a DS entry that pointed to
key tag:25992and hence was still expecting DNSSEC to be turned on.
- 5 hours after that, they added back the missing DNSKEY
key tag:25992(oh we found it!), but added invalid Signatures for all entries — still failing.
- Only 4 hours after that did they fix it, with the proper DS entry on root for DNSKEY
key tag:63366and valid signatures.
- That’s a 24 hour outage on all
So basically, something broke, they sat on it for 15 hours, then tried a fix, didn’t work. Tried something else 5 hours after that, didn’t work again! And finally after presumably a lot of praying to the Gods of the Internet and a couple animal sacrifices, managed to fix it after a 24-hour downtime.
I defend my fellow IT practitioners a lot on this blog, but this is a difficult one. Clearly this was the work of someone who didn’t know what they were doing, and refused to ask for help, instead tried one failed fix after another which made things worse. As my good friend Mark Twain would say — it’s like a Mouse trying to fix a pumpkin.
I don’t fully understand DNSSEC (it’s complicated), but I’m not in charge of a TLD. It’s unacceptable that someone could screw up this badly — and for that screw up to impact so many people, and all we got was a lousy press release.
The point is, it shouldn’t take 24 hours to resolve a DNSSEC issue, especially when it’s such a critical piece of infrastructure. I’ve gone through reports of similar DNSSEC failures, and in most cases recovery takes 1-5 hours. The
.nasa.gov TLD had a similar issue, that was resolved in an hour, very rarely do we see a 24 hour outage, so what gives?
I look forward to an official report from MyNIC to our spanking new communications ministry, and for that to be shared to the public. Continue reading