Community Confessions
The Wall of Pain
42 entries. Every single one is true. Names have been changed to protect the guilty.
Spent 6 hours debugging an intermittent outage. It was a duplex mismatch from 2009. The fix was one command. I will never speak of this.
Replaced the fiber. Replaced the GBICs. Replaced the switch. Replaced the router. It was the patch cable. The patch cable that came with the printer in 2016.
User's 'urgent, business-critical' ticket: emojis in Outlook were 'too colorful' and giving them a headache.
Traced a 400ms latency spike to a spanning-tree TCN being generated every 30 seconds. The source was a cheap unmanaged switch some developer plugged in under their desk 'just temporarily' in 2021.
On-call at 3am for a 'full network outage.' The building's power had been cut for scheduled maintenance. Facilities forgot to tell IT. Again.
The OSPF adjacency between two routers was flapping every 47 minutes like clockwork. Took three days to find it: a scheduled script was running 'traceroute' with a TTL of 1 toward the neighbor, generating ICMP errors that reset the dead-interval timer.
A developer asked me why his TCP connections were dropping every 350 seconds. I explained NAT table timeouts. He said 'can you just make it not do that?' I said no. He filed a ticket with my manager.
Received a P1 at 2am: 'Core router is on fire.' It was metaphorical. The CPU was at 78% due to debug ip packet being left on. Nobody knew who enabled it.
Site-to-site VPN dropped every night at exactly 22:17. For two weeks I thought it was a timeout issue. It was a scheduled task on the remote side that ran 'ipconfig /release' followed by 'ipconfig /renew' and nobody thought to mention it.
The QoS policy marking voice traffic as Best Effort. The QoS policy has been there for 8 years. Voice has been 'slightly degraded' for 8 years. Everyone assumed it was the phones.
Submitted a change request to fix a critical routing loop. It was approved 6 weeks later. The loop had fixed itself by then. The change request asked me to document what changed. I wrote 'the universe healed.'
Got paged for 'internet is down.' Ran over. One user's WiFi adapter was disabled. The user had 34 coworkers convinced the internet was down. I re-enabled the adapter. I received no credit.
Cisco TAC told me to upgrade the IOS. I upgraded the IOS. It broke three other things. Cisco TAC told me to upgrade the IOS again.
We had a BGP peer that kept resetting. After two weeks and a Wireshark capture reviewed by 4 people, we found that a firewall was resetting TCP sessions over 60 minutes old. The BGP keepalive was 59 minutes. Someone had done this intentionally. They no longer work here.
The 'mystery packet drops' that happened only on Tuesday afternoons: a backup job ran, filled the switch CPU, and the control plane couldn't process BPDUs, causing a STP topology change. Every Tuesday. For a year.
User reported slow internet. Ran speed test from their desk: 850Mbps down, 920Mbps up on a Gigabit connection. They said 'see, it's slow, it used to be 1000.' I stared at the wall for a while.
The data center had 'mysterious packet loss' at layer 2. We ran packet captures for 4 days. It was a network engineer testing a new IDS in inline mode and forgetting to tell anyone.
Vendor's recommended configuration guide said 'do not use in production environments.' We asked why. They said 'that's just boilerplate.' We used it in production. It was not boilerplate.
Received an escalation that 'BGP is broken.' Checked the BGP table. 47 prefixes in the global table. We had accidentally set a maximum-prefix limit of 47 and hit it at exactly 9am on a Monday.
Decommissioned a router that was supposedly unused. Four services went down. None of them were documented. Two of them were load-testing tools pointed at prod that nobody had shut off.
A 10-year-old ACL had 'permit ip any any' at line 999. Everything above it was so specific that it only caught 3 packets in 10 years. The ACL was labeled 'DO NOT REMOVE – UNKNOWN PURPOSE.'
Traced an authentication failure to an NTP drift of 34 seconds between the RADIUS server and the authenticating device. The NTP server was our own. It had been syncing to itself for 3 years.
The office moved to a new building. The network worked perfectly. We got zero credit. Three weeks later one user's VLAN was wrong and IT was described in the all-hands as 'always causing problems.'
A P1 at midnight: 'authentication is broken for all VPN users.' Root cause: someone had enabled two-factor auth on the RADIUS server and then gone on vacation without telling anyone. The MFA app had been decommissioned in 2022.
I was asked to 'make the internet faster.' I optimized routing, tuned TCP parameters, deployed a CDN. 30% improvement. They noticed a YouTube ad buffer once and filed another ticket.
The 'new' firewall that had been running for 8 months had never been given a license. It had been running in evaluation mode, silently missing feature updates. The vendor called it a 'known limitation.'
Discovered a GRE tunnel between two routers with no documentation. It was carrying production traffic. It had a TTL of 64. It had been there for 11 years. Nobody knew who built it. It is still there.
The management VLAN was VLAN 1. The native VLAN was VLAN 1. The voice VLAN was VLAN 1. We inherited this network. It had been in production for 6 years with no incidents. I still have nightmares.
Got paged because a server team's monitoring dashboard showed 100% packet loss to a host. The host was behind a firewall that blocked ICMP. It had always blocked ICMP. The dashboard was new.
A critical application was unreachable from a branch office. After 3 hours: the application team had moved the server to a new IP address and updated the DNS record but not the firewall rule. The old IP was unreachable. The DNS record was correct. The firewall had the old IP. The application team described this as 'a network issue.'
The wireless survey recommended 22 APs. Budget approved 8. Performance complaints started on day one. This is in the ticket system under 'network team to investigate.'
We had a 100% packet loss event that lasted 90 seconds on a Monday at 10:02am every single week. It was a weekly conference call that started at 10:00. The building's wireless was overwhelmed by 200 people joining simultaneously. The 'fix' was to stagger the invite times by 3 minutes.
Spent two weeks chasing intermittent connectivity to a partner. Their firewall admin had configured a 'temporary' permit rule that was scheduled to expire — set in 2019 with a 5-year timer. It expired during my on-call shift. I do not believe in coincidence anymore.
The firewall ruleset had 4,847 rules. The first rule was 'permit any any log' for 'troubleshooting.' Added in 2017. Logging had filled and silently rotated for 7 years. Nothing else mattered.
BGP session to our transit provider kept resetting every few hours. Their NOC insisted it was on our side. After 3 days of captures we proved it was their MD5 password rotation script running on a stale config. They told us to 'just match it on our end.' We did not.
We were leaking a /24 to the internet for 11 minutes before our upstream filtered it. We didn't notice. A stranger emailed us from a NANOG mailing list to let us know. He was kinder about it than my manager was.
Inherited a firewall with rule descriptions like 'ask Brian,' 'temporary - 2014,' 'do not delete,' and 'works do not touch.' Brian retired in 2018. Removing any rule causes something to break. Adding rules has been banned by management. The firewall is now sentient.
iBGP full mesh of 9 routers. Someone added a 10th and only peered it with two others. For three months, certain prefixes only existed on certain routers. Traffic took scenic routes. Users called it 'the slow days.'
DHCP scope exhaustion at 4:47pm every Friday. For months. Turned out the guest WiFi was bridged into the corporate scope by a misconfigured AP, and a nearby café's customers were leasing our addresses on their walk home.
Customer reported their VPN was slow. We ran tests: 940Mbps throughput. They insisted it was slow. After an hour of back-and-forth, they admitted they were comparing it to their home internet. Their home internet was a 200Mbps connection. The VPN was 'slow' because it was 'only 940.'
We had a route-map with a regex that matched 'AS path contains 7018.' It was meant to prefer one transit provider. Someone added a second clause years later that contradicted the first. Both clauses were 'permit.' Traffic chose its own adventure for 4 years.
DNS resolution was slow company-wide. Recursive resolver had been forwarding to a public DNS that had quietly deprecated UDP responses over 512 bytes. Every query was retrying over TCP. Latency went from 2ms to 180ms. The DNS team said 'DNS is fine, it's the network.' It is always the network.
All stories are based on real events in the network engineering community.
Any resemblance to your specific disaster is coincidental and also entirely plausible.
No blame has been assigned. Change management is still reviewing the blame assignment request.
ETA: next maintenance window.