Talk
Virtual
When cluster upgrades deadlock: Topology, PVCs & autoscaling under stress
Cluster upgrades are supposed to be routine. Ours weren’t. During a GKE upgrade, topology constraints, PVC binding, autoscaling limits, and PDBs interacted in unexpected ways. We share what broke and how we redesigned for safer disruption.
CEST
Meet the speakers
Maintaining parity between production and non-production clusters sounds like best practice, and that is what was intended. But during a GKE upgrade, that parity exposed trade-offs that were not anticipated. As nodes drained, PodDisruptionBudgets slowed eviction, zonal PVC binding restricted scheduling, and autoscaler scale-ups did not always add usable capacity. In staging, where resources were tighter, the behavior was even more pronounced. What looked like isolated issues turned out to be deeply connected.
In this talk, the speakers walk through what happened, how topology and autoscaling interacted under stress, and the practical changes they made to balance environment parity, cost efficiency, and upgrade reliability.
