Efficient Access to Shared GPU Resources: Part 4
This is part 4 of a series of blog posts about GPU concurrency mechanisms. In part 1 we focused on the pros and cons of different solutions available on Kubernetes, in part 2 we dove into the setup and configuration details, and in part 3 we analyzed the benchmarking use cases.
In this part 4 we will focus on benchmarking results for time slicing of NVIDIA cards.
Setup
The benchmark setup will be the same for every use case:
- Time slicing disabled (GPU Passthrough)
- Time slicing enabled (the number denotes how many processes are scheduled on the same GPU):
- Shared x1
- Shared x2
- Shared x4
- Shared x8
Benchmarking time slicing can be complicated, because the processes need to start at the exact same moment. This means using a Deployment or a ReplicaSet will not work, because the pods are launched in a best effort manner, with some pods starting earlier than others.
The GPU alternates the execution in a round robin fashion. To benchmark, we start longer-running GPU processes in advance, to eliminate the need of start up synchronization. For example, to benchmark a script in a “Shared x4” GPU setup, we can:
- start 3 pods running the same script, but for a longer period of time.
- in the meantime, start the fourth pod. Make sure it starts and finishes while sharing the GPU with other 3.
Find more about the drivers installation, time slicing configuration, and environment setup.
FLOPS Counting
Floating Point Operations Per Second (FLOPS) is the metric used to show how powerful a GPU is when working with different data formats. To count FLOPS we rely on dcgmproftester, a CUDA based test load generator from NVIDIA. For more information, consult the previous blogpost.
fp16
Passthrough | Shared x1 | Shared x2 | Shared x4 | Shared x8 | |
---|---|---|---|---|---|
Average TFLOPS per process | 32.866 | 32.700 | 15.933 | 7.956 | 3.968 |
Average TFLOPS per process * number of processes | 32.866 | 32.700 | 31.867 | 31.824 | 31.745 |
Performance Loss (compared to Passthrough) | - | 0.5% | 3.03% | 3.17% | 3.41% |
fp32
Passthrough | Shared x1 | Shared x2 | Shared x4 | Shared x8 | |
---|---|---|---|---|---|
Average TFLOPS per process | 16.898 | 16.879 | 7.880 | 3.945 | 1.974 |
Average TFLOPS per process * number of processes | 16.898 | 16.879 | 15.76 | 15.783 | 15.795 |
Performance Loss (compared to Passthrough) | - | 0.11% | 6.73% | 6.59% | 6.52% |
fp64
Passthrough | Shared x1 | Shared x2 | Shared x4 | Shared x8 | |
---|---|---|---|---|---|
Average TFLOPS per process | 8.052 | 8.050 | 3.762 | 1.871 | 0.939 |
Average TFLOPS per process * number of processes | 8.052 | 8.050 | 7.524 | 7.486 | 7.515 |
Performance Loss (compared to Passthrough) | - | 0.02% | 6.55% | 7.03% | 6.67% |
fp16 Tensor Cores
Passthrough | Shared x1 | Shared x2 | Shared x4 | Shared x8 | |
---|---|---|---|---|---|
Average TFLOPS per process | 165.992 | 165.697 | 81.850 | 41.161 | 20.627 |
Average TFLOPS per process * number of processes | 165.992 | 165.697 | 163.715 | 164.645 | 165.021 |
Performance Loss (compared to Passthrough) | - | 0.17% | 1.37% | 0.81% | 0.58% |
Conclusions
- If time slicing is enabled, but only one process is using the GPU (shared x1), the time slicing penalty is negligible (<0.5%).
- When the GPU needs to do context switching (shared x2), there is a ~6% performance loss for fp32 and fp64, ~3% for fp16 and ~1.37% for fp16 on tensor cores.
- We don’t introduce an additional penalty when increasing the amount of processes sharing the GPU. The loss is the same for shared x2, shared x4, and shared x8.
- It is not yet understood why the performance loss is different depending on the data formats.
Compute-Intensive Particle Simulation
An important part of CERN computing is dedicated to simulation. These are compute-intensive operations that can significantly benefit from GPU usage. For this benchmarking we rely on the lhc simpletrack simulation. For more information, consult the previous blogpost.
Passthrough vs Shared x1
Number of particles | Passthrough [s] | Shared x1 [s] | Loss [%] |
---|---|---|---|
5 000 000 | 26.365 | 27.03 | 2.52 |
10 000 000 | 51.135 | 51.93 | 1.55 |
15 000 000 | 76.374 | 77.12 | 0.97 |
20 000 000 | 99.55 | 99.91 | 0.36 |
30 000 000 | 151.57 | 152.61 | 0.68 |
Shared x1 vs Shared x2
Number of particles | Shared x1 [s] | Expected Shared x2 = 2*Shared x1 [s] | Actual Shared x2 [s] | Loss [%] |
---|---|---|---|---|
5 000 000 | 27.03 | 54.06 | 72.59 | 34.27 |
10 000 000 | 51.93 | 103.86 | 138.76 | 33.6 |
15 000 000 | 77.12 | 154.24 | 212.71 | 37.9 |
20 000 000 | 99.91 | 199.82 | 276.23 | 38.23 |
30 000 000 | 152.61 | 305.22 | 423.08 | 38.61 |
Shared x2 vs Shared x4
Number of particles | Shared x2 [s] | Expected Shared x4 = 2*Shared x2 [s] | Actual Shared x4 [s] | Loss [%] |
---|---|---|---|---|
5 000 000 | 72.59 | 145.18 | 142.63 | 0 |
10 000 000 | 138.76 | 277.52 | 281.98 | 1.6 |
15 000 000 | 212.71 | 425.42 | 421.55 | 0 |
20 000 000 | 276.23 | 552.46 | 546.19 | 0 |
30 000 000 | 423.08 | 846.16 | 838.55 | 0 |
Shared x4 vs Shared x8
Number of particles | Shared x4 [s] | Expected Shared x8 = 2*Shared x4 [s] | Shared x8 [s] | Loss [%] |
---|---|---|---|---|
5 000 000 | 142.63 | 285.26 | 282.56 | 0 |
10 000 000 | 281.98 | 563.96 | 561.98 | 0 |
15 000 000 | 421.55 | 843.1 | 838.22 | 0 |
20 000 000 | 546.19 | 1092.38 | 1087.99 | 0 |
30 000 000 | 838.55 | 1677.1 | 1672.95 | 0 |
Conclusions
- The performance loss when enabling time slicing (shared x1) is very small (<2.5%).
- If the GPU needs to perform context switching (going from shared x1 to shared x2) the execution time triples. This means there is a performance loss of ~38%.
- Further increasing the number of processes (shared x4, shared x8), doesn’t introduce additional performance loss.
Machine Learning Training
For benchmarking we will use a pretrained model and fine-tune it with PyTorch. To maximize GPU utilization, make sure the script is not CPU-bound, by increasing the number of data loader workers and batch size. More details can be found in the previous blogpost.
Passthrough vs Shared x1
TrainingArguments:
- per_device_train_batch_size=48
- per_device_eval_batch_size=48
- dataloader_num_workers=8
Number of particles | Passthrough [s] | Shared x1 [s] | Loss [%] |
---|---|---|---|
500 | 16.497 | 16.6078 | 0.67 |
1 000 | 31.2464 | 31.4142 | 0.53 |
2 000 | 61.1451 | 61.3885 | 0.39 |
5 000 | 150.8432 | 151.1182 | 0.18 |
10 000 | 302.2547 | 302.4283 | 0.05 |
Shared x1 vs Shared x2
TrainingArguments:
- per_device_train_batch_size=24
- per_device_eval_batch_size=24
- dataloader_num_workers=4
Number of particles | Shared x1 [s] | Expected Shared x2 = 2*Shared x1 [s] | Actual Shared x2 [s] | Loss [%] |
---|---|---|---|---|
500 | 16.9597 | 33.9194 | 36.7628 | 8.38 |
1 000 | 32.8355 | 65.671 | 72.9985 | 11.15 |
2 000 | 64.2533 | 128.5066 | 143.3033 | 11.51 |
5 000 | 161.5249 | 323.0498 | 355.0302 | 9.89 |
Shared x2 vs Shared x4
TrainingArguments:
- per_device_train_batch_size=12
- per_device_eval_batch_size=12
- dataloader_num_workers=2
Number of particles | Shared x2 [s] | Expected Shared x4 = 2*Shared x2 [s] | Actual Shared x4 [s] | Loss [%] |
---|---|---|---|---|
500 | 39.187 | 78.374 | 77.2388 | 0 |
1 000 | 77.3014 | 154.6028 | 153.4177 | 0 |
2 000 | 154.294 | 308.588 | 306.0012 | 0 |
5 000 | 385.6539 | 771.3078 | 762.5113 | 0 |
Shared x4 vs Shared x8
TrainingArguments:
- per_device_train_batch_size=4
- per_device_eval_batch_size=4
- dataloader_num_workers=1
Number of particles | Shared x4 [s] | Expected Shared x8 = 2*Shared x4 [s] | Shared x8 [s] | Loss [%] |
---|---|---|---|---|
500 | 104.6849 | 209.3698 | 212.6313 | 1.55 |
1 000 | 185.1633 | 370.3266 | 381.7454 | 3.08 |
2 000 | 397.8525 | 795.705 | 816.353 | 2.59 |
5 000 | 1001.752 | 2003.504 | 1999.2395 | 0 |
Conclusions
- The loss when performing ML training on a GPU with time slicing enabled is negligible (<0.7%).
- When scaling from shared x1 to shared x2 we get a ~2.2 times increase in overall computation (vs ideal 2). The time slicing loss is around 11% in this case.
- If the number of processes increases (shared x4, shared x8), the performance is not influenced a lot (0-3%).
Takeaways
- Considering the potential GPU utilization improvements, the penalty introduced by enabling time slicing but having only one process using the GPU (Shared x1) can be disregarded.
- There is a variable penalty introduced when the GPU needs to perform context switching (shared x1 vs shared x2).
- For more than 2 processes sharing the GPU, the execution time scales linearly (no extra penalty if we increase the number of processes sharing a GPU).
- The penalty introduced by time slicing can potentially be very big, depending on the use case:
- When the GPU was running context switch sensitive workloads the penalty introduced by time slicing was about 38%.
- In case the program is IO bound, CPU bound, heavy on tensor core utilization, etc, the penalty alleviates. For instance, with the ML traning the penalty dropped to ~11%.
- If the processes consume more memory than available, some of them will die because of OOM error. We cannot set limits or priorities for memory, we discussed earlier how we try to go around it (see more).
Time slicing can potentially introduce a big performance penalty. Even so, when applied to the correct use cases, it can be a powerful way of boosting GPU utilization. Consult the available overview to find more.
Next episode
In the next blog post, we will dive into the extensive MIG benchmarking. Stay tuned!