Efficient Access to Shared GPU Resources: Part 1

Mechanisms, Motivations and Use Cases for GPU concurrency on Kubernetes

By Dejan Golubovic, Diana Gaponcic, Diogo Guerra, Ricardo Rocha | Monday, January 09, 2023

Tags:

gpu
public

GPUs are shaping the way organizations access and use their data and CERN is not an exception. Traditional High Energy Physics (HEP) analysis and deployments are being rethought and accelerators remain the key to enabling efficient Machine Learning (ML).

In this series of blog posts we will cover the use cases and technologies that motivate and enable efficient sharing of GPUs on Kubernetes. For both on-premises and public cloud (on demand) access to accelerators, this can be a key factor for a cost effective use of these resources.

Note

This post focuses on NVIDIA cards, similar mechanisms might be offered by other vendors.

Motivation

CERN’s main facility today is the Large Hadron Collider. Its experiments generate billions of particle collisions per second, with these numbers about to grow with planned upgrades. The result are hundreds of PetaBytes of data to be reconstructed and analized using large amounts of computing resources.

Even more data is generated from physics simulation which remains a cost effective way to guide the design and optimization of these giant machines as well as a basis to compare results with a well defined physics model.

GPUs are taking a central role in different areas:

As a more efficient replacement for traditional CPU cores in simulation, or even to replace custom hardware with more flexible resources in online triggers.
In ML for particle classification during event reconstruction (GNN), much faster generation of simulation data (3DGAN), reinforcement learning for beam calibration, among others.

As demand grows one important aspect is to ensure this type of (expensive) hardware is optimally utilized. This can be a challenge given:

Workloads are often not capable of taking full advantage of the resources due to usage patterns, suboptimal code, etc. As with CPU virtualization, enabling resource sharing can mitigate this loss.
Many of these workloads are spiky which can trigger significant waste if resources are locked for long periods. This is often the case during the interactive and development phase of the different components, or for services with uneven load.

Kubernetes has had support for different types of GPUs for a while now although not as first class resources and limited to dedicated, full card allocation. With the demand growing and Kubernetes established as the de-facto platform in many areas, multiple solutions exist today to enable concurrent access to GPU resources from independent workloads.

It is essential to understand each solution’s benefits and tradeoffs to enable an informed decision.

GPU Concurrency Mechanisms

Note

By concurrency we mean going beyond simple GPU sharing. GPU sharing includes deployments where a given pool of GPUs is shared but each card is assigned to only one workload at a time for a limited (or not) amount of time.

The figure below summarizes the multiple concurrency options with NVIDIA cards.

Out of the different mechanisms above we will not cover those that are CUDA-specific (single and multiple process CUDA) and will briefly cover the possibility of simply co-locating workloads on a single card.

Co-located Workloads

Co-locating workloads refers to uncontrolled access to a single GPU. At CERN an example of such offering is the lxplus interactive service which has dedicated nodes with GPUs. Users login to shared virtual machines each exposing a single card via PCI passthrough.

Advantages:

The easiest way to provide GPU concurrency.
Works by simply exposing the card via PCI passthrough.