by meetpateltech on 10/23/25, 1:19 AM with 166 comments
by tptacek on 10/23/25, 7:28 PM
https://how.complexsystems.fail/
You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)
by stefan_bobev on 10/23/25, 7:27 PM
But the stale read didn't scare me nearly as much as this quote:
> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues
Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.
by jasode on 10/23/25, 8:02 AM
>[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]
It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.
by ecnahc515 on 10/23/25, 6:36 PM
by asim on 10/24/25, 1:24 PM
by al_be_back on 10/24/25, 1:24 AM
The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.
Now it has consolidated into ever smaller mono nets. A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally. This can only get much worse when cyber warfare gets involved.
Personally, I think the cloud metaphor has overstretched and has long burst.
For R&D, early stage start-ups and occasional/seasonal computing, cloud works perfectly (similar to how time-sharing systems used to work).
For well established/growth businesses and gov, you better become self-reliant and tech independent: own physical servers + own cloud + own essential services (db, messaging, payment).
There's no shortage of affordable tech, know-how or workforce.
by gslin on 10/23/25, 9:12 AM
by baalimago on 10/24/25, 8:11 AM
776 words in a single paragraph
by __turbobrew__ on 10/23/25, 7:45 PM
If we assume that the system will fail, I think the logical thing to think about is how to limit the effects of that failure. In practice this means cell based architecture, phased rollouts, and isolated zones.
To my knowledge AWS does attempt to implement cell based architecture, but there are some cross region dependencies specifically with us-east-1 due to legacy. The real long term fix for this is designing regions to be independent of each other.
This is a hard thing to do, but it is possible. I have personally been involved in disaster testing where a region was purposely firewalled off from the rest of the infrastructure. You find out very quick where those cross region dependencies lie, and many of them are in unexpected places.
Usually this work is not done due to lack of upper level VP support and funding, and it is easier to stick your head in the sand and hope bad things don’t happen. The strongest supporters of this work are going to be the share holders who are in it for the long run. If the company goes poof due to improper disaster testing, the shareholders are going to be the main bag holders. Making the board aware of the risks and the estimated probability of fundamentally company ending events can help get this work funded.
by JCM9 on 10/24/25, 2:32 AM
The region model is a lot less robust if core things in other regions require US-East-1 to operate. This has been an issue in previous outages and appears to have struck again this week.
It is what it is, but AWS consistently oversells the robustness of regions as fully separate when events like Monday reveal they’re really not.
by JohnMakin on 10/24/25, 12:05 PM
Anyway appreciate that this seems pretty honest and descriptive.
by shayonj on 10/23/25, 7:40 AM
Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:
by giamma on 10/24/25, 7:17 AM
by rr808 on 10/24/25, 2:06 AM
by everfrustrated on 10/23/25, 1:46 PM
Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can resolve to one of a hundred thousand IP address?
That's insane!
And also well beyond the limits of route53.
I'm wondering if they're constantly updating route53 with a smaller subset of records and using a low ttl to somewhat work around this.
by grogers on 10/24/25, 3:10 AM
I feel like I am missing something here... They make it sound like the DNS enactor basically diffs the current state of DNS with the desired state, and then submits the adds/deletes needed to make the DNS go to the desired state.
With the racing writers, wouldn't that have just made the DNS go back to an older state? Why did it remove all the IPs entirely?
by lazystar on 10/23/25, 6:01 AM
interesting.
by yla92 on 10/23/25, 3:02 AM
by pelagicAustral on 10/23/25, 5:58 PM
by ericpauley on 10/23/25, 11:19 AM
by WaitWaitWha on 10/23/25, 5:52 PM
Correct?
by dilyevsky on 10/23/25, 8:47 PM
by qrush on 10/23/25, 5:25 PM
by joeyhage on 10/23/25, 1:57 AM
It isn't explicitly stated in the RCA but it is likely these new endpoints were the straw that broke the camel's back for the DynamoDB load balancer DNS automation
by polyglotfacto2 on 10/24/25, 8:32 AM
by 827a on 10/23/25, 11:37 PM
> Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality...
by martythemaniak on 10/24/25, 3:23 AM
by galaxy01 on 10/23/25, 5:18 AM
by Velocifyer on 10/23/25, 8:44 PM
by bithavoc on 10/23/25, 8:28 PM
by danpalmer on 10/23/25, 11:37 PM
by alexnewman on 10/23/25, 9:00 PM
by LaserToy on 10/23/25, 4:09 PM
by shrubble on 10/23/25, 7:11 PM
So if you made a change you had to increase the number, usually a timestamp like 20250906114509 which would be older / lower numbered than 20250906114702; making it easier to determine which zone file had the newest data.
Seems like they sort of had the same setup but with less rigidity in terms of refusing to load older files.