Resilient on-premises AI workloads on Kubernetes with hyperconverged infrastructure

This session will explore how platform engineers can build resilient on-premises infrastructure for AI workloads on OpenShift. It will cover best practices in networking, storage, and compute, as well as strategies for backup, disaster recovery, and automation to ensure high availability and operational efficiency.

As AI workloads continue to grow in complexity and demand, platform engineers are tasked with building resilient, scalable infrastructure. This talk will focus on deploying OpenShift clusters on hyperconverged infrastructure (HCI), ensuring the high availability of workloads and enhancing operational efficiency. Hyperconverged infrastructure integrates compute, storage, and networking into a single system, simplifying management and improving performance. Shajeer Mohammed will discuss how to design a fault-tolerant system with multiple servers and networks, eliminating single points of failure. In particular, the session will explore the role of Software-defined Storage (SDS) in providing scalability, resilience, and seamless data access for AI workloads. In addition to infrastructure design, ensuring business continuity is crucial. The session will cover the implementation of backup policies and disaster recovery plans, focusing on minimizing downtime and protecting data. Strategies such as DR protections will be discussed to safeguard against data loss in the event of a disaster. Attendees will also compare the benefits and trade-offs of running workloads on bare metal versus virtual machines, with an emphasis on performance and reliability. The talk will include guidance on using automated monitoring, alerting tools, firmware upgrades, auto-scaling, and proactive issue resolution to streamline day-two operations.