Talk

Virtual

Lessons learned orchestrating multi-tenant GPUs on OpenShift AI with NVIDIA KAI (G/H200)

Shared production GPUs for AI/ML, done right: lessons from multi-tenant orchestration on OpenShift AI with NVIDIA KAI on G/H200, Ingress, on Dell AI servers. Covers isolation, scheduling, MIG trade-offs, tuning, upgrades, and day-2 ops.

CEST

How do teams run shared, production-grade GPUs for AI/ML safely and efficiently? This experience report distills hard-won lessons from implementing multi-tenant GPU orchestration on OpenShift AI using NVIDIA KAI on G/H200 hardware, fronted by Traefik and backed by Dell Technologies platforms. It covers tenant isolation patterns (namespaces, quotas, priority classes), scheduling on heterogeneous nodes, MIG versus full-GPU trade-offs, throughput versus latency tuning, driver and firmware pitfalls, upgrade and rollback strategies, and day-2 operations (observability, autoscaling, chargeback). Attendees can expect practical manifests and guardrails they can apply immediately.

Virtual

Register for PlatformCon 2026