Talk
On-demand
Virtual
The 900-second ghost: Debugging half-open TCP
A deep-dive into a 15-minute production outage caused by half-open TCP connections, exploring FIN vs. RST packets, Zero Window effects, and lessons on kernel keepalives and debugging silent network failures at hyperscale.
15
mins
Meet the speakers
What happens when a production incident lasts 15 minutes, yet monitoring systems report everything as "green"? At hyperscale, supporting on-demand services through mobile applications in Southeast Asia's most populous countries, a team encountered a silent and elusive failure mode: half-open TCP connections.
In this deep-dive session, the speaker conducts a packet-level autopsy of a real-world incident using the Wireshark tool that impacted millions of messages. The talk examines the critical differences between FIN and RST packets, demonstrating how the absence of a single 40-byte segment resulted in 900 seconds of effective downtime. Attendees will learn why relying on default Linux kernel tcp_keepalive settings is unsafe for high-availability systems.
The session also explores the Zero Window phenomenon and how TCP backpressure can cause message timestamp drift. This is a story of persistence, spanning detailed packet captures and collaboration with a major cloud provider's networking team to correct load balancer FIN-delivery behavior.
