from Hacker News

Summary of the Amazon DynamoDB Service Disruption in US-East-1 Region

by meetpateltech on 10/23/25, 1:19 AM with 166 comments

Recent and related: AWS multiple services outage in us-east-1 - https://news.ycombinator.com/item?id=45640838 (2045 comments)
  • by tptacek on 10/23/25, 7:28 PM

    I'm a tedious broken record about this (among many other things) but if you haven't read this Richard Cook piece, I strongly recommend you stop reading this postmortem and go read Cook's piece first. It won't take you long. It's the single best piece of writing about this topic I have ever read and I think the piece of technical writing that has done the most to change my thinking:

    https://how.complexsystems.fail/

    You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)

  • by stefan_bobev on 10/23/25, 7:27 PM

    I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not.

    But the stale read didn't scare me nearly as much as this quote:

    > Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues

    Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.

    [1] https://youtu.be/h47zZrqjgLc?t=1587

  • by jasode on 10/23/25, 8:02 AM

    So the DNS records if-stale-then-needs-update it was basically a variation of the "2 Hard Things In Computer Science - cache invalidation". Excerpt from the giant paragraph:

    >[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]

    It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.

  • by ecnahc515 on 10/23/25, 6:36 PM

    Seems like the enactor should be checking the version/generation of the current record before it applies the new value, to ensure it never applies an old plan on top of an record updated by a new plan. It wouldn't be as efficient, but that's just how it is. It's a basic compare and swap operation, so it could be handled easily within dynamodb itself where these records are stored.
  • by asim on 10/24/25, 1:24 PM

    On the one hand it's an incentive to shift away to smaller self management where you don't need AWS e.g as an individual I just run a single DigitalOcean VPS. But on the other hand if you're a large business the evaluation process is basically, can I tolerate this kind of incident once in a while versus the massive operational cost of doing it myself. It's really going to be a case by case study of who stays, who moves and who tries some multicloud failover. It's not one of those situations where you can blanket just say oh this is terrible, stupid, should never happen, let's get off AWS. This is the slow build up of dependency on something people value. That's not going to change quickly. It might never change. The too big to fail mantra of banks applies. What happens next is essentially very anticlimactic which is to say, nothing.
  • by al_be_back on 10/24/25, 1:24 AM

    Postmortem all you want - the internet is breaking, hard.

    The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.

    Now it has consolidated into ever smaller mono nets. A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally. This can only get much worse when cyber warfare gets involved.

    Personally, I think the cloud metaphor has overstretched and has long burst.

    For R&D, early stage start-ups and occasional/seasonal computing, cloud works perfectly (similar to how time-sharing systems used to work).

    For well established/growth businesses and gov, you better become self-reliant and tech independent: own physical servers + own cloud + own essential services (db, messaging, payment).

    There's no shortage of affordable tech, know-how or workforce.

  • by gslin on 10/23/25, 9:12 AM

    I believe a report with timezone not using UTC is a crime.
  • by baalimago on 10/24/25, 8:11 AM

    Did they intentionally make it dense and complicated to discourage anyone from actually reading it..?

    776 words in a single paragraph

  • by __turbobrew__ on 10/23/25, 7:45 PM

    From a meta analysis level: bugs will always happen, formal verification is hard, and sometimes it just takes a number of years to have some bad luck (I have hit bugs which were over 10 years old but due to low probability of them occurring they didn’t happen for a long time).

    If we assume that the system will fail, I think the logical thing to think about is how to limit the effects of that failure. In practice this means cell based architecture, phased rollouts, and isolated zones.

    To my knowledge AWS does attempt to implement cell based architecture, but there are some cross region dependencies specifically with us-east-1 due to legacy. The real long term fix for this is designing regions to be independent of each other.

    This is a hard thing to do, but it is possible. I have personally been involved in disaster testing where a region was purposely firewalled off from the rest of the infrastructure. You find out very quick where those cross region dependencies lie, and many of them are in unexpected places.

    Usually this work is not done due to lack of upper level VP support and funding, and it is easier to stick your head in the sand and hope bad things don’t happen. The strongest supporters of this work are going to be the share holders who are in it for the long run. If the company goes poof due to improper disaster testing, the shareholders are going to be the main bag holders. Making the board aware of the risks and the estimated probability of fundamentally company ending events can help get this work funded.

  • by JCM9 on 10/24/25, 2:32 AM

    Good to see a detailed summary. The frustration from a customer perspective is that AWS continues to have these cross-region issues and they continue to be very secretive about where these single points of failure exist.

    The region model is a lot less robust if core things in other regions require US-East-1 to operate. This has been an issue in previous outages and appears to have struck again this week.

    It is what it is, but AWS consistently oversells the robustness of regions as fully separate when events like Monday reveal they’re really not.

  • by JohnMakin on 10/24/25, 12:05 PM

    Was still seeing SQS latency affecting my systems a full day after they gave the “all clear.” There are red flags all over this summary to me, particularly the case where they had no operational procedure for recovery. That seems to me impossible in a hyperscaler - you never considered this failure scenario, ever? Or did you lose engineers that did know?

    Anyway appreciate that this seems pretty honest and descriptive.

  • by shayonj on 10/23/25, 7:40 AM

    I was kinda surprised the lack of CAS on per-endpoint plan version or rejecting stale writes via 2PC or single-writer lease per endpoint like patterns.

    Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:

  • by giamma on 10/24/25, 7:17 AM

  • by rr808 on 10/24/25, 2:06 AM

    https://newsletter.pragmaticengineer.com/p/what-caused-the-l... has a better explanation instead of the wall of text from AWS
  • by everfrustrated on 10/23/25, 1:46 PM

    >Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region

    Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can resolve to one of a hundred thousand IP address?

    That's insane!

    And also well beyond the limits of route53.

    I'm wondering if they're constantly updating route53 with a smaller subset of records and using a low ttl to somewhat work around this.

  • by grogers on 10/24/25, 3:10 AM

    > As this plan was deleted, all IP addresses for the regional endpoint were immediately removed.

    I feel like I am missing something here... They make it sound like the DNS enactor basically diffs the current state of DNS with the desired state, and then submits the adds/deletes needed to make the DNS go to the desired state.

    With the racing writers, wouldn't that have just made the DNS go back to an older state? Why did it remove all the IPs entirely?

  • by lazystar on 10/23/25, 6:01 AM

    > Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues.

    interesting.

  • by yla92 on 10/23/25, 3:02 AM

    So the root cause is basically race condition 101 stale read ?
  • by pelagicAustral on 10/23/25, 5:58 PM

    Had no idea Dynamo was so intertwined with the whole AWS stack.
  • by ericpauley on 10/23/25, 11:19 AM

    Interesting use of the phrase “Route53 transaction” for an operation that has no hard transactional guarantees. Especially given the lack of transactional updates are what caused the outage…
  • by WaitWaitWha on 10/23/25, 5:52 PM

    I gather, the root cause was a latent race condition in the DynamoDB DNS management system that allowed an outdated DNS plan to overwrite the current one, resulting in an empty DNS record for the regional endpoint.

    Correct?

  • by dilyevsky on 10/23/25, 8:47 PM

    Sounds like they went with Availability over Correctness with this design but the problem is that if your core foundational config is not correct you get no availability either.
  • by qrush on 10/23/25, 5:25 PM

    Sounds like DynamoDB is going to continue to be a hard dependency for EC2, etc. I at least appreciate the transparency and hearing about their internal systems names.
  • by joeyhage on 10/23/25, 1:57 AM

    > as is the case with the recently launched IPv6 endpoint and the public regional endpoint

    It isn't explicitly stated in the RCA but it is likely these new endpoints were the straw that broke the camel's back for the DynamoDB load balancer DNS automation

  • by polyglotfacto2 on 10/24/25, 8:32 AM

    Use TLA+ (which I thought they did)
  • by 827a on 10/23/25, 11:37 PM

    I made it about ten lines into this before realizing that, against all odds, I wasn't reading a postmortem, I was reading marketing material designed to sell AWS.

    > Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality...

  • by martythemaniak on 10/24/25, 3:23 AM

    It's not DNS There's no way it's DNS It was DNS
  • by galaxy01 on 10/23/25, 5:18 AM

    Would conditional read/write solve this? looks like some kind of stale read
  • by Velocifyer on 10/23/25, 8:44 PM

    This is unreadable and terribly formatted.
  • by bithavoc on 10/23/25, 8:28 PM

    does DynamoDB run on EC2? if I read it right, EC2 depends on DynamoDB.
  • by danpalmer on 10/23/25, 11:37 PM

    776 word paragraph and 28 word screen width, this is practically unreadable.
  • by alexnewman on 10/23/25, 9:00 PM

    Is it the internal dynamodb that other people use?
  • by LaserToy on 10/23/25, 4:09 PM

    TLDR: A DNS automation bug removed all the IP addresses for the regional endpoints. The tooling that was supposed to help with recovery depends on the system it needed to recover. That’s a classic “we deleted prod” failure mode at AWS scale.
  • by shrubble on 10/23/25, 7:11 PM

    The Bind resolver required each zone to have an increasing serial number for the zone.

    So if you made a change you had to increase the number, usually a timestamp like 20250906114509 which would be older / lower numbered than 20250906114702; making it easier to determine which zone file had the newest data.

    Seems like they sort of had the same setup but with less rigidity in terms of refusing to load older files.