Efficient Access to Shared GPU Resources: Part 5
This is part 5 of a series of blog posts about GPU concurrency mechanisms. In part 1 we focused on the pros and cons of different solutions available on Kubernetes, in part 2 we dove into the setup and configuration details, in part 3 we analyzed the benchmarking use cases, and in part 4 we benchmarked the time slicing mechanism.
In this part 5 we will focus on benchmarking MIG (Multi-Instance GPU) performance of NVIDIA cards.
Setup
The benchmark setup will be the same for every use case:
- Full GPU (MIG disabled) vs MIG enabled (7g.40gb partition)
- MIG enabled, different partitions:
- 7g.40gb
- 3g.20gb
- 2g.10gb
- 1g.5gb
Keep in mind that the nvidia-smi
command will not return the GPU utilization if MIG is enabled. This is the expected behaviour, as NVML (NVIDIA management library) does not support attribution of utilization metrics to MIG devices. For monitoring MIG capable GPUs it is recommended to rely on NVIDIA DCGM instead.
Find more about the drivers installation, MIG configuration, and environment setup.
Theoretical MIG performance penalty
When sharing a GPU between multiple processes using time slicing, there is a performance loss caused by the context switching. As a result, when enabling time slicing but scheduling a single process on the GPU, the penalty can be neglected.
On the other hand, with MIG, the assumptions are very different. By just enabling MIG, a part of the Streaming Multiprocessors are lost. But there is no additional penalty introduced when further partitioning the GPU.
For instance, as can be seen in the image below, a whole A100 40GB NVIDIA GPU (the GPU used for the benchmarking that follows) has 108 Streaming Multiprocessors (SMs). When enabling MIG, 10 SMs are lost, which is the equivalent of 9.25% of the total number of compute cores.
As a result, when enabling MIG, it is expected to see a performance penalty of ~9.25%. As partitions are considered isolated, we shouldn’t have additional overhead when sharing the GPU between many users. We expect the scaling between partitions to be linear, meaning a 2g.10gb partition should perform 2 times better than a 1g.5gb because it has double the resources.
FLOPS Counting
Floating Point Operations Per Second (FLOPS) is the metric used to show how powerful a GPU is when working with different data formats. To count FLOPS we rely on dcgmproftester, a CUDA-based test load generator from NVIDIA. For more information, consult the previous blog post.
Full GPU (MIG disabled) vs MIG enabled (7g.40gb partition)
Formats | Full GPU (MIG disabled) [TFLOPS] | MIG enabled (7g.40gb) [TFLOPS] | Loss [%] |
---|---|---|---|
fp16, Cuda Cores | 32.785 | 30.583 | 6.71 |
fp32, Cuda Cores | 16.773 | 15.312 | 8.71 |
fp64, Cuda Cores | 8.128 | 7.386 | 9.12 |
fp16, Tensor Cores | 164.373 | 151.701 | 7.70 |
FLOPS counting per MIG partition
Formats | 7g.40gb [TFLOPS] | 3g.20gb [TFLOPS] | 2g.10gb [TFLOPS] | 1g.5gb [TFLOPS] | |
---|---|---|---|---|---|
fp16, Cuda Cores | 30.583 | 13.714 | 9.135 | 4.348 | |
fp32, Cuda Cores | 15.312 | 6.682 | 4.418 | 2.132 | |
fp64, Cuda Cores | 7.386 | 3.332 | 2.206 | 1.056 | |
fp16, Tensor Cores | 151.701 | 94.197 | 65.968 | 30.108 |
FLOPS scaling between MIG partitions
Formats | 7g.40gb / 3g.20gb | 3g.20gb / 2g.10gb | 2g.10gb / 1g.5gb |
---|---|---|---|
fp16, Cuda Cores | 2.23 | 1.50 | 2.10 |
fp32, Cuda Cores | 2.29 | 1.51 | 2.07 |
fp64, Cuda Cores | 2.21 | 1.51 | 2.08 |
fp16, Tensor Cores | 1.61 | 1.42 | 2.19 |
Ideal Scale | 7/3=2.33 | 3/2=1.5 | 2/1=2 |
Memory bandwidth scaling between MIG partitions
Partition | Memory bandwidth | Multiplying factor |
---|---|---|
7g.40gb | 1555.2 GB | 8 |
3g.20gb | 777.6 GB | 4 |
2g.10gb | 388.8 GB | 2 |
1g.5gb | 194.4 GB | 1 |
Conclusions
- Enabling MIG:
- For fp32 and fp64, Cuda Cores, the drop in performance is close to the theoretical one.
- For fp16 on Cuda Cores and Tensor Cores, the loss is much smaller than the expected value. Might need further investigation.
- There is no loss of memory bandwidth.
- The scaling between partitions:
- On CUDA Cores (fp16, fp32, fp64) the scaling is converging to the theoretical value.
- On Tensor Cores the scaling diverges a lot from the expected one (especially when comparing 7g.40gb and 3g.20gb). Might need further investigation.
- The scaling of the memory bandwidth is based on powers of 2 (1, 2, 4, and 8 accordingly).
Compute-Intensive Particle Simulation
An important part of CERN computing is dedicated to simulation. These are compute-intensive operations that can significantly benefit from GPU usage. For this benchmarking, we rely on the lhc simpletrack simulation. For more information, consult the previous blog post.
Full GPU (MIG disabled) vs MIG enabled (7g.40gb partition)
Number of particles | Full GPU (MIG disabled) [seconds] | MIG enabled (7g.40gb) [seconds] | Loss [%] |
---|---|---|---|
5 000 000 | 26.365 | 28.732 | 8.97 |
10 000 000 | 51.135 | 55.930 | 9.37 |
15 000 000 | 76.374 | 83.184 | 8.91 |
Running simulation on different MIG partitions
Number of particles | 7g.40gb [seconds] | 3g.20gb [seconds] | 2g.10gb [seconds] | 1g.5gb [seconds] | |
---|---|---|---|---|---|
5 000 000 | 28.732 | 62.268 | 92.394 | 182.32 | |
10 000 000 | 55.930 | 122.864 | 183.01 | 362.10 | |
15 000 000 | 83.184 | 183.688 | 273.700 | 542.300 |
Scaling between MIG partitions
Number of particles | 3g.20gb / 7g.40gb | 2g.10gb / 3g.20gb | 1g.5gb / 2g.10gb |
---|---|---|---|
5 000 000 | 2.16 | 1.48 | 1.97 |
10 000 000 | 2.19 | 1.48 | 1.97 |
15 000 000 | 2.20 | 1.49 | 1.98 |
Ideal Scale | 7/3=2.33 | 3/2=1.5 | 2/1=2 |
Conclusions
- The performance loss when enabling MIG is very close to the theoretical 9.25%.
- The scaling between partitions converges to the ideal values.
- The results are very close to the theoretical assumptions because the benchmarked script is very compute-intensive, without many memory accesses, being CPU bound, etc.
Machine Learning Training
For benchmarking, we will use a pre-trained model and fine-tune it with PyTorch. To maximize GPU utilization, make sure the script is not CPU-bound, by increasing the number of data loader workers and batch size. More details can be found in the previous blog post.
Full GPU (MIG disabled) vs MIG enabled (7g.40gb partition)
- dataloader_num_workers=8
- per_device_train_batch_size=48
- per_device_eval_batch_size=48
Dataset size | Full GPU (MIG disabled) [seconds] | MIG enabled (7g.40gb) [seconds] | Loss [%] |
---|---|---|---|
2 000 | 63.30 | 65.77 | 3.90 |
5 000 | 152.91 | 157.86 | 3.23 |
10 000 | 303.95 | 313.22 | 3.04 |
15 000 | 602.70 | 622.24 | 3.24 |
7g.40gb vs 3g.20gb
- per_device_train_batch_size=24
- per_device_eval_batch_size=24
- dataloader_num_workers=4
Dataset size | 7g.40gb [seconds] | 3g.20gb [seconds] | 3g.20gb / 7g.40gb (Expected 7/3=2.33) |
---|---|---|---|
2 000 | 67.1968 | 119.4738 | 1.77 |
5 000 | 334.2252 | 609.2308 | 1.82 |
10 000 | 334.2252 | 609.2308 | 1.82 |
When comparing a 7g.40gb instance vs a 3g.20gb one, the amount of cores becomes 2.33 (7/3) times smaller. This is the scale we are expecting to see experimentally as well, but the results are converging to 1.8 rather than 2.3. For machine learning training, the results are influenced a lot by the available memory, bandwidth, how the data is stored, how efficient the data loader is, etc.
To simplify the benchmarking, we will use the 4g.20gb partition instead of the 3g.20gb. This way all the resources (bandwidth, cuda cores, tensor cores, memory) are double when compared to 2g.10gb, and the ideal scaling factor is 2.
4g.20gb vs 2g.10gb:
- per_device_train_batch_size=12
- per_device_eval_batch_size=12
- dataloader_num_workers=4
Dataset size | 4g.20gb [seconds] | 2g.10gb [seconds] | 2g.10gb / 4g.20gb (Expected 4/2=2) |
---|---|---|---|
2 000 | 119.2099 | 223.188 | 1.87 |
5 000 | 294.6218 | 556.4449 | 1.88 |
10 000 | 589.0617 | 1112.927 | 1.88 |
2g.10gb vs 1g.5gb:
- per_device_train_batch_size=4
- per_device_eval_batch_size=4
- dataloader_num_workers=2
Dataset size | 2g.10gb [seconds] | 1g.5gb [seconds] | 1g.5gb / 2g.10gb (Expected 2/1=2) |
---|---|---|---|
2 000 | 271.6612 | 525.9507 | 1.93 |
5 000 | 676.3226 | 1316.2178 | 1.94 |
10 000 | 1356.9108 | 2625.1624 | 1.93 |
Conclusions
- The performance loss when enabling MIG is much smaller than the theoretical 9.25%. This can be caused by many reasons:
- complex operations to be performed
- being IO bound
- executing at a different clock frequency
- variable tensor core utilization, etc.
- The training time depends heavily on the number of CUDA Cores, Tensor Cores, but also the memory bandwidth, the data loader, the batch size, etc. Consult the previous blog for some ideas on how to profile and detect performance bottlenecks of models.
- The scaling between partitions is linear and is converging to the expected value. Especially on smaller partitions.
Takeaways
- When using MIG technology, the GPU partitions are isolated and can run securely without influencing each other.
- A part of the available streaming multiprocessors are lost when enabling MIG:
- In the case of an A100 40GB NVIDIA GPU, 10/108 SMs, which means losing 9.25% of the available Cuda and Tensor cores.
- Never enable MIG without actually partitioning the GPU (the 7g.40gb partition). It means losing performance without gaining GPU sharing.
- When enabling MIG:
- The performance loss for compute-intensive applications can reach ~9.25%.
- For machine learning training the loss alleviates as a result of memory accessing, being IO bound, being CPU bound, etc, and experimentally it will be much smaller than expected.
- The scaling between partitions is linear. Doubling the resources halves the execution time.
- For GPU monitoring rely on NVIDIA Data Center GPU Manager (DCGM). It is a suite of tools that includes active health monitoring, diagnostics, power and clock management, etc.
- A current limitation (with CUDA 11/R450 and CUDA 12/R525) that might be relaxed in the future is that regardless of how many MIG devices are created (or made available to a container), a single CUDA process can only enumerate a single MIG device. The implications are:
- We cannot use 2 MIG instances in a single process.
- If at least one device is in MIG mode, CUDA will not see the non-MIG GPUs.
Next episode
In the next blog post, we will use NVIDIA A100 GPUs and MIG to train in distributed mode a high energy physics (HEP) neural network on-prem Kubeflow. Stay tuned!