RSS

Efficient Access to Shared GPU Resources: Part 4

GPU concurrency time slicing benchmarking

This is part 4 of a series of blog posts about GPU concurrency mechanisms. In part 1 we focused on the pros and cons of different solutions available on Kubernetes, in part 2 we dove into the setup and configuration details, and in part 3 we analyzed the benchmarking use cases.

In this part 4 we will focus on benchmarking results for time slicing of NVIDIA cards.

Setup

The benchmark setup will be the same for every use case:

  • Time slicing disabled (GPU Passthrough)
  • Time slicing enabled (the number denotes how many processes are scheduled on the same GPU):
    • Shared x1
    • Shared x2
    • Shared x4
    • Shared x8

Benchmarking time slicing can be complicated, because the processes need to start at the exact same moment. This means using a Deployment or a ReplicaSet will not work, because the pods are launched in a best effort manner, with some pods starting earlier than others.

The GPU alternates the execution in a round robin fashion. To benchmark, we start longer-running GPU processes in advance, to eliminate the need of start up synchronization. For example, to benchmark a script in a “Shared x4” GPU setup, we can:

  • start 3 pods running the same script, but for a longer period of time.
  • in the meantime, start the fourth pod. Make sure it starts and finishes while sharing the GPU with other 3.

Find more about the drivers installation, time slicing configuration, and environment setup.

FLOPS Counting

Floating Point Operations Per Second (FLOPS) is the metric used to show how powerful a GPU is when working with different data formats. To count FLOPS we rely on dcgmproftester, a CUDA based test load generator from NVIDIA. For more information, consult the previous blogpost.

fp16

Passthrough Shared x1 Shared x2 Shared x4 Shared x8
Average TFLOPS per process 32.866 32.700 15.933 7.956 3.968
Average TFLOPS per process * number of processes 32.866 32.700 31.867 31.824 31.745
Performance Loss (compared to Passthrough) - 0.5% 3.03% 3.17% 3.41%

fp32

Passthrough Shared x1 Shared x2 Shared x4 Shared x8
Average TFLOPS per process 16.898 16.879 7.880 3.945 1.974
Average TFLOPS per process * number of processes 16.898 16.879 15.76 15.783 15.795
Performance Loss (compared to Passthrough) - 0.11% 6.73% 6.59% 6.52%

fp64

Passthrough Shared x1 Shared x2 Shared x4 Shared x8
Average TFLOPS per process 8.052 8.050 3.762 1.871 0.939
Average TFLOPS per process * number of processes 8.052 8.050 7.524 7.486 7.515
Performance Loss (compared to Passthrough) - 0.02% 6.55% 7.03% 6.67%

fp16 Tensor Cores

Passthrough Shared x1 Shared x2 Shared x4 Shared x8
Average TFLOPS per process 165.992 165.697 81.850 41.161 20.627
Average TFLOPS per process * number of processes 165.992 165.697 163.715 164.645 165.021
Performance Loss (compared to Passthrough) - 0.17% 1.37% 0.81% 0.58%

Conclusions

  • If time slicing is enabled, but only one process is using the GPU (shared x1), the time slicing penalty is negligible (<0.5%).
  • When the GPU needs to do context switching (shared x2), there is a ~6% performance loss for fp32 and fp64, ~3% for fp16 and ~1.37% for fp16 on tensor cores.
  • We don’t introduce an additional penalty when increasing the amount of processes sharing the GPU. The loss is the same for shared x2, shared x4, and shared x8.
  • It is not yet understood why the performance loss is different depending on the data formats.

Compute-Intensive Particle Simulation

An important part of CERN computing is dedicated to simulation. These are compute-intensive operations that can significantly benefit from GPU usage. For this benchmarking we rely on the lhc simpletrack simulation. For more information, consult the previous blogpost.

Passthrough vs Shared x1

Number of particles Passthrough [s] Shared x1 [s] Loss [%]
5 000 000 26.365 27.03 2.52
10 000 000 51.135 51.93 1.55
15 000 000 76.374 77.12 0.97
20 000 000 99.55 99.91 0.36
30 000 000 151.57 152.61 0.68

Shared x1 vs Shared x2

Number of particles Shared x1 [s] Expected Shared x2 = 2*Shared x1 [s] Actual Shared x2 [s] Loss [%]
5 000 000 27.03 54.06 72.59 34.27
10 000 000 51.93 103.86 138.76 33.6
15 000 000 77.12 154.24 212.71 37.9
20 000 000 99.91 199.82 276.23 38.23
30 000 000 152.61 305.22 423.08 38.61

Shared x2 vs Shared x4

Number of particles Shared x2 [s] Expected Shared x4 = 2*Shared x2 [s] Actual Shared x4 [s] Loss [%]
5 000 000 72.59 145.18 142.63 0
10 000 000 138.76 277.52 281.98 1.6
15 000 000 212.71 425.42 421.55 0
20 000 000 276.23 552.46 546.19 0
30 000 000 423.08 846.16 838.55 0

Shared x4 vs Shared x8

Number of particles Shared x4 [s] Expected Shared x8 = 2*Shared x4 [s] Shared x8 [s] Loss [%]
5 000 000 142.63 285.26 282.56 0
10 000 000 281.98 563.96 561.98 0
15 000 000 421.55 843.1 838.22 0
20 000 000 546.19 1092.38 1087.99 0
30 000 000 838.55 1677.1 1672.95 0

Conclusions

  • The performance loss when enabling time slicing (shared x1) is very small (<2.5%).
  • If the GPU needs to perform context switching (going from shared x1 to shared x2) the execution time triples. This means there is a performance loss of ~38%.
  • Further increasing the number of processes (shared x4, shared x8), doesn’t introduce additional performance loss.

Machine Learning Training

For benchmarking we will use a pretrained model and fine-tune it with PyTorch. To maximize GPU utilization, make sure the script is not CPU-bound, by increasing the number of data loader workers and batch size. More details can be found in the previous blogpost.

Passthrough vs Shared x1

TrainingArguments:

  • per_device_train_batch_size=48
  • per_device_eval_batch_size=48
  • dataloader_num_workers=8
Number of particles Passthrough [s] Shared x1 [s] Loss [%]
500 16.497 16.6078 0.67
1 000 31.2464 31.4142 0.53
2 000 61.1451 61.3885 0.39
5 000 150.8432 151.1182 0.18
10 000 302.2547 302.4283 0.05

Shared x1 vs Shared x2

TrainingArguments:

  • per_device_train_batch_size=24
  • per_device_eval_batch_size=24
  • dataloader_num_workers=4
Number of particles Shared x1 [s] Expected Shared x2 = 2*Shared x1 [s] Actual Shared x2 [s] Loss [%]
500 16.9597 33.9194 36.7628 8.38
1 000 32.8355 65.671 72.9985 11.15
2 000 64.2533 128.5066 143.3033 11.51
5 000 161.5249 323.0498 355.0302 9.89

Shared x2 vs Shared x4

TrainingArguments:

  • per_device_train_batch_size=12
  • per_device_eval_batch_size=12
  • dataloader_num_workers=2
Number of particles Shared x2 [s] Expected Shared x4 = 2*Shared x2 [s] Actual Shared x4 [s] Loss [%]
500 39.187 78.374 77.2388 0
1 000 77.3014 154.6028 153.4177 0
2 000 154.294 308.588 306.0012 0
5 000 385.6539 771.3078 762.5113 0

Shared x4 vs Shared x8

TrainingArguments:

  • per_device_train_batch_size=4
  • per_device_eval_batch_size=4
  • dataloader_num_workers=1
Number of particles Shared x4 [s] Expected Shared x8 = 2*Shared x4 [s] Shared x8 [s] Loss [%]
500 104.6849 209.3698 212.6313 1.55
1 000 185.1633 370.3266 381.7454 3.08
2 000 397.8525 795.705 816.353 2.59
5 000 1001.752 2003.504 1999.2395 0

Conclusions

  • The loss when performing ML training on a GPU with time slicing enabled is negligible (<0.7%).
  • When scaling from shared x1 to shared x2 we get a ~2.2 times increase in overall computation (vs ideal 2). The time slicing loss is around 11% in this case.
  • If the number of processes increases (shared x4, shared x8), the performance is not influenced a lot (0-3%).

Takeaways

  • Considering the potential GPU utilization improvements, the penalty introduced by enabling time slicing but having only one process using the GPU (Shared x1) can be disregarded.
  • There is a variable penalty introduced when the GPU needs to perform context switching (shared x1 vs shared x2).
  • For more than 2 processes sharing the GPU, the execution time scales linearly (no extra penalty if we increase the number of processes sharing a GPU).
  • The penalty introduced by time slicing can potentially be very big, depending on the use case:
    • When the GPU was running context switch sensitive workloads the penalty introduced by time slicing was about 38%.
    • In case the program is IO bound, CPU bound, heavy on tensor core utilization, etc, the penalty alleviates. For instance, with the ML traning the penalty dropped to ~11%.
  • If the processes consume more memory than available, some of them will die because of OOM error. We cannot set limits or priorities for memory, we discussed earlier how we try to go around it (see more).

Time slicing can potentially introduce a big performance penalty. Even so, when applied to the correct use cases, it can be a powerful way of boosting GPU utilization. Consult the available overview to find more.

Next episode

In the next blog post, we will dive into the extensive MIG benchmarking. Stay tuned!