Talk

Virtual

The 900-second ghost: Debugging half-open TCP

A deep-dive into a 15-minute production outage caused by half-open TCP connections, exploring FIN vs. RST packets, Zero Window effects, and lessons on kernel keepalives and debugging silent network failures at hyperscale.

CEST

What happens when a production incident lasts 15 minutes, yet monitoring systems report everything as "green"? At hyperscale, supporting on-demand services through mobile applications in Southeast Asia's most populous countries, a team encountered a silent and elusive failure mode: half-open TCP connections.

In this deep-dive session, the speaker conducts a packet-level autopsy of a real-world incident using the Wireshark tool that impacted millions of messages. The talk examines the critical differences between FIN and RST packets, demonstrating how the absence of a single 40-byte segment resulted in 900 seconds of effective downtime. Attendees will learn why relying on default Linux kernel tcp_keepalive settings is unsafe for high-availability systems.

The session also explores the Zero Window phenomenon and how TCP backpressure can cause message timestamp drift. This is a story of persistence, spanning detailed packet captures and collaboration with a major cloud provider's networking team to correct load balancer FIN-delivery behavior.

Virtual

Register for PlatformCon 2026