Convergence of AI and HPC Workloads – Implications for Cloud-Native and Hybrid Architectures

Gaurav Kaul, HPE

AI workloads driven by the rise in popularity of deep learning are now being increasingly run on HPC infrastructure, both on-prem and in cloud. At surface level this convergence may be seen due to the reliance on accelerators for AI workloads such as GPUs and ASICs. While at hardware level, deep learning is leading a renaissance in new hardware development, even more impact can be expected when deep learning algorithms are used in simulations for HPC. This creates several challenges especially around data management, versioning and archiving which the machine learning community has spent lot of time in perfecting. Similarly rise of containers and container orchestration such as Docker and Kubernetes is leading allows ease of deployment in hybrid environments and scaling it out. This creates somewhat of a divergence in terms of software stack as containers and orchestration, a la Kubernetes, is not mainstream yet in HPC which still relies on traditional batch schedulers. In this talk, we look at the building blocks of such HPC and AI Platform which are built in modules which customers can add and customize to provide a seamless experience for their AI and HPC users on a common hardware platform.