by doppelgunner on 8/17/25, 2:31 PM with 11 comments
All that time lost for one tiny dot.
by SvenL on 8/17/25, 2:54 PM
by mikewarot on 8/18/25, 2:12 PM
We recently repaired a TV he acquired from 1948.... we eventually got it working, albeit with a very dim display, as the CRT was older than most of us here on HN. All was well, but there was an intermittent... I eventually traced it down to a single wire... that looked like it was connected, but measured as an open circuit. It hadn't been soldered correctly in 1948, and percussive maintenance would make it contact well enough to then arc and stay connected (it was in a high voltage focus line)
We had a Kilowatt RF Power Amplifier that just wasn't working correctly, no gain, a "grounded grid" design. It turned out that after much grief, the grid wasn't grounded... the bolt grounding the grid looked like it was well connected, but measured as an open circuit!
In software, it's amazing how many times = and == get confused, one of the many reasons I prefer Pascal, with = vs :=
Of course the ultimate story is the "magic/more magic" story from MIT.[1]
[1] https://web.archive.org/web/20250713004447/http://catb.org/j...
by john-tells-all on 8/17/25, 2:43 PM
Errors are no problem. Silent errors, not so much.
by carlnewton on 8/17/25, 10:07 PM
by doctor_radium on 8/17/25, 3:14 PM
by rmb177 on 8/17/25, 5:40 PM
by aspenmayer on 8/17/25, 9:06 PM
by incomingpain on 8/17/25, 5:43 PM
At some point after my design and implementation, junior sysadmin on my team added an inline, fail-open, security device in front of my firewall. It was a requirement from cyber security insurance. insurance claimed that it would always fail open.
Junior admin never told me, never updated documentation, never setup anything in network monitoring to indicate anything about that device. But we also didnt have any access to it. The only thing I knew was that it used the public ip of my firewall to transparently tunnel back to the insurance for management.
Months later i get a ticket about a 4 second outage. Literally nothing in my cisco logs; but i can confirm 4 second outage in network monitor. so i go to the datacenter operators who handle the transit and such. They come back and they dont see anything at all; even their network stats show nothing. Presumably not polling as often as we are.
It's 4 seconds, it didnt get much attention; until it became an intermittent issue about a week later. Not every day, not some specific time of day. Sometimes it happened in the middle of the night and virtually nobody noticed except our network monitor. Thin clients usually but not always reconnected within seconds. At most it was a like 10 second window when the organization, minimum hundreds of people stop working and then start working.
Cisco TAC gets on it and I'm getting a proper CCIE and we are doing caps wherein traffic leaves our firewall properly. It's just never coming back. The datacenter op on the otherhand had the same experience with their dell 10gbit switches. traffic was leaving our links. Link status never dropped on either side.
TAC is like, go check and see if a repeater between is failing. We have onsite hands do the check, but the DC had no repeaters. it was a solid fiber line in the same room into our rack. Onsite hands saw our tap in our rack but figured that was the firewall and failed to mention the device; just said there was a solid fiber line into the rack. The DC then closed the case and TAC didnt think it was on us. The firewall clearly sends the packets out the live link.
At this point we'd had maybe 60 seconds of outage; ~10 events.
A completely unrelated to me outage now happens with minutes of downtime. The DC is at fault now; but the last messaging to the CIO was that the DC wasnt at fault for the previous outages. Not to mention the DC hasnt said squat about any outage, which they always send out maintenance emails even if its unrelated. So I get about 4 hour meeting in the hot seat to address the problems with my network design. That if the problem isnt possible according to experts from DC and TAC... it's on me.
Top severity case with TAC next and the engineer is new as my current engineer isnt online. They find the outage was the DC; and I even get an email SUPER LATE(at least 12 hours late) saying they were done fixing the new outage.
Too late now... so TAC ends up being completely unhelpful. I make the 4 hour drive from Detroit to find a unmarked black box in front of my firewall. I had no idea what it was, I took it out of the loop causing a brief outage. Then i took it out of the rack and opened up to see if inside the box had anything. i took pictures thinking industrial espionage or something.
Obviously our problem was solved after this.
I later got in trouble for voiding the warranty on that rented device from the insurance and I get explained what the device was etc. All I could do was laugh.