Talk

Virtual

Same code, same GPUs, same result: Our Kubernetes platform story

AI workloads kept breaking whenever the GPU stack or cluster changed. This talk shows how the platform team designed versioned GPU environments on Kubernetes so the same code reliably produces the same result across time and teams.

CEST

Adit Modi explains how a platform team discovered that the same Ray workload on the same NVIDIA GPUs could behave differently depending on when and where it ran on AWS. Small, uncoordinated changes to Kubernetes versions, GPU node groups, NVIDIA drivers, CUDA stacks, and base images, often applied through Terraform in separate repositories, made GPU jobs fragile and difficult to reproduce during incidents.

He describes how the team introduced versioned GPU environments as a first-class platform concept on AWS EKS and how AI teams now select a version instead of a cluster.

• How mismatched versions of Kubernetes, NVIDIA drivers, CUDA, and Ray on AWS lead to "it worked last week" GPU bugs
• How to model GPU environments as versioned Terraform modules and Kubernetes manifests instead of ad hoc clusters
• A practical approach to evolving these versions safely through testing, deprecation, and migration without breaking existing Ray workloads
• How this change improved on-call debugging, rollback strategies, and trust in the platform for AI teams

Virtual

Register for PlatformCon 2026