Microsoft just showed the world what happens when you screw up your DNS records

Okay, this might actually be worse than the infamous “blue screen of death” that people used to tease Microsoft about. Yesterday, a huge array of core Microsoft services like Azure, Microsoft 365, Dynamics, DevOps and more went down…for two grueling hours.

How did it happen?

Like the title of my post says, someone made a big mistake while updating DNS records, here’s the scoop:

Summary of impact: Between 19:43 and 22:35 UTC on 02 May 2019, customers may have experienced intermittent connectivity issues with Azure and other Microsoft services (including M365, Dynamics, DevOps, etc). Most services were recovered by 21:30 UTC with the remaining recovered by 22:35 UTC. 

Preliminary root cause: Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services. During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.

(Source – Microsoft Azure Status History)

While Microsoft calls it “intermittent connectivity issues” the reality is, they were down for almost a full two hours.

A few things came to mind when I read the news. First, yikes! This is a great warning for anyone who is making DNS changes, double, triple, heck – quadruple check your changes because if you screw it up, bad things can happen.

The second thing that comes to mind is, why were they down for so long? I’ll be the first to admit that I’ve screwed up DNS records in the past, and it took a minute or two to get everything back in order. Two hours is an insanely long amount of time to get everything back up and running.

So here’s what I think happened.

If it was you or me that made this mistake, we would have realized it, and fixed it in a few minutes. The problem here is – Microsoft is a HUGE company with a lot of laborious, draconian processes. So, once the issue was identified, rather than immediately fixing it, they probably had to hold a meeting, document exactly what happened, document the fix, get the fix approved by a Director who would then confirm with a VP.

If this isn’t what happened then I’m completely confused as to why it would take so long to correct this. I don’t think things were down for two hours because nobody knew how to fix the problem, I think things were down for two hours because they have so many rules and regulations for what to do when there is a problem, that it took them a million times longer to fix it.

Either way, it’s fixed now, and should serve as a good lesson to anyone updating their DNS records to really make sure you’re doing it right.

What do you think? Is my theory for why this was down for so long right…or do you have another theory? I want to hear from you, comment and let your voice be heard!

{ 2 comments… add one }

  • Rob Monster - Epik.com May 3, 2019, 9:19 am

    Good topic, Morgan.

    Long TTLs are a double-edged sword. If you have a long TTL and your DNS is out of commission, your active users hobble along and nobody really notices.

    However, if you screw up and update, you are now hosed for the duration of that TTL for anyone who got the wrong TTL.

    The default TTL at Epik is 300 seconds. This is a tight TTL. Users can lengthen it but that is not a free lunch.

    Last week we deployed our own Anycast DNS powered by BitMitigate. So, this is now TTL of 300 seconds, with resolution times of as low as 2 milliseconds. That’s pretty cool.

    In the meantime, the object lesson is that DNS is a whole lot more vulnerable than most people want to acknowledge. The scope for misrouting is as large as ever.

    Hence Resilient Domains:

    https://www.epik.com/resilient/

    Reply
  • Logan May 3, 2019, 3:41 pm

    What I don’t understand is why didn’t Clippy jump out and say, “It looks like you’re making a DNS change. Are you sure you want to do that?”

    Reply

Leave a Comment