Efficient Access to Shared GPU Resources: Part 4

GPU concurrency time slicing benchmarking

By Dejan Golubovic, Diana Gaponcic, Diogo Guerra, Ricardo Rocha | Thursday, April 27, 2023

Tags:

gpu
public

This is part 4 of a series of blog posts about GPU concurrency mechanisms. In part 1 we focused on the pros and cons of different solutions available on Kubernetes, in part 2 we dove into the setup and configuration details, and in part 3 we analyzed the benchmarking use cases.

In this part 4 we will focus on benchmarking results for time slicing of NVIDIA cards.

This series focuses on NVIDIA cards, although similar mechanisms might be offered by other vendors.

Setup

The benchmark setup will be the same for every use case:

Time slicing disabled (GPU Passthrough)
Time slicing enabled (the number denotes how many processes are scheduled on the same GPU):
- Shared x1
- Shared x2
- Shared x4
- Shared x8

Benchmarking time slicing can be complicated, because the processes need to start at the exact same moment. This means using a Deployment or a ReplicaSet will not work, because the pods are launched in a best effort manner, with some pods starting earlier than others.

The GPU alternates the execution in a round robin fashion. To benchmark, we start longer-running GPU processes in advance, to eliminate the need of start up synchronization. For example, to benchmark a script in a “Shared x4” GPU setup, we can:

start 3 pods running the same script, but for a longer period of time.
in the meantime, start the fourth pod. Make sure it starts and finishes while sharing the GPU with other 3.

Find more about the drivers installation, time slicing configuration, and environment setup.

FLOPS Counting

Floating Point Operations Per Second (FLOPS) is the metric used to show how powerful a GPU is when working with different data formats. To count FLOPS we rely on dcgmproftester, a CUDA based test load generator from NVIDIA. For more information, consult the previous blogpost.

fp16

	Passthrough	Shared x1	Shared x2	Shared x4	Shared x8
Average TFLOPS per process	32.866	32.700	15.933	7.956	3.968
Average TFLOPS per process * number of processes	32.866	32.700	31.867	31.824	31.745
Performance Loss (compared to Passthrough)	-	0.5%	3.03%	3.17%	3.41%

fp32

	Passthrough	Shared x1	Shared x2	Shared x4	Shared x8
Average TFLOPS per process	16.898	16.879	7.880	3.945	1.974
Average TFLOPS per process * number of processes	16.898	16.879	15.76	15.783	15.795
Performance Loss (compared to Passthrough)	-	0.11%	6.73%	6.59%	6.52%

fp64

	Passthrough	Shared x1	Shared x2	Shared x4	Shared x8
Average TFLOPS per process	8.052	8.050	3.762	1.871	0.939
Average TFLOPS per process * number of processes	8.052	8.050	7.524	7.486	7.515
Performance Loss (compared to Passthrough)	-	0.02%	6.55%	7.03%	6.67%

fp16 Tensor Cores

	Passthrough	Shared x1	Shared x2	Shared x4	Shared x8
Average TFLOPS per process	165.992	165.697	81.850	41.161	20.627
Average TFLOPS per process * number of processes	165.992	165.697	163.715	164.645	165.021
Performance Loss (compared to Passthrough)	-	0.17%	1.37%	0.81%	0.58%

Conclusions

If time slicing is enabled, but only one process is using the GPU (shared x1), the time slicing penalty is negligible (<0.5%).
When the GPU needs to do context switching (shared x2), there is a ~6% performance loss for fp32 and fp64, ~3% for fp16 and ~1.37% for fp16 on tensor cores.
We don’t introduce an additional penalty when increasing the amount of processes sharing the GPU. The loss is the same for shared x2, shared x4, and shared x8.
It is not yet understood why the performance loss is different depending on the data formats.

Compute-Intensive Particle Simulation

An important part of CERN computing is dedicated to simulation. These are compute-intensive operations that can significantly benefit from GPU usage. For this benchmarking we rely on the lhc simpletrack simulation. For more information, consult the previous blogpost.

Passthrough vs Shared x1

Number of particles	Passthrough [s]	Shared x1 [s]	Loss [%]
5 000 000	26.365	27.03	2.52
10 000 000	51.135	51.93	1.55
15 000 000	76.374	77.12	0.97
20 000 000	99.55	99.91	0.36
30 000 000	151.57	152.61	0.68

Shared x1 vs Shared x2

Number of particles	Shared x1 [s]	Expected Shared x2 = 2*Shared x1 [s]	Actual Shared x2 [s]	Loss [%]
5 000 000	27.03	54.06	72.59	34.27
10 000 000	51.93	103.86	138.76	33.6
15 000 000	77.12	154.24	212.71	37.9
20 000 000	99.91	199.82	276.23	38.23
30 000 000	152.61	305.22	423.08	38.61

Shared x2 vs Shared x4

Number of particles	Shared x2 [s]	Expected Shared x4 = 2*Shared x2 [s]	Actual Shared x4 [s]	Loss [%]
5 000 000	72.59	145.18	142.63	0
10 000 000	138.76	277.52	281.98	1.6
15 000 000	212.71	425.42	421.55	0
20 000 000	276.23	552.46	546.19	0
30 000 000	423.08	846.16	838.55	0

Shared x4 vs Shared x8

For shared x8 case, inputs bigger than 30 000 000 will result in OOM error. Managing the amount of memory every process consumes is one of the biggest challenges to address when using time slicing sharing mechanism.

Number of particles	Shared x4 [s]	Expected Shared x8 = 2*Shared x4 [s]	Shared x8 [s]
5 000 000	142.63	285.26	282.56
10 000 000	281.98	563.96	561.98
15 000 000	421.55	843.1	838.22
20 000 000	546.19	1092.38	1087.99
30 000 000	838.55	1677.1	1672.95

Conclusions

The performance loss when enabling time slicing (shared x1) is very small (<2.5%).
If the GPU needs to perform context switching (going from shared x1 to shared x2) the execution time triples. This means there is a performance loss of ~38%.
Further increasing the number of processes (shared x4, shared x8), doesn’t introduce additional performance loss.

Machine Learning Training

For benchmarking we will use a pretrained model and fine-tune it with PyTorch. To maximize GPU utilization, make sure the script is not CPU-bound, by increasing the number of data loader workers and batch size. More details can be found in the previous blogpost.

Passthrough vs Shared x1

TrainingArguments:

per_device_train_batch_size=48
per_device_eval_batch_size=48
dataloader_num_workers=8

Number of particles	Passthrough [s]	Shared x1 [s]	Loss [%]
500	16.497	16.6078	0.67
1 000	31.2464	31.4142	0.53
2 000	61.1451	61.3885	0.39
5 000	150.8432	151.1182	0.18
10 000	302.2547	302.4283	0.05

Shared x1 vs Shared x2

TrainingArguments:

per_device_train_batch_size=24
per_device_eval_batch_size=24
dataloader_num_workers=4

Number of particles	Shared x1 [s]	Expected Shared x2 = 2*Shared x1 [s]	Actual Shared x2 [s]	Loss [%]
500	16.9597	33.9194	36.7628	8.38
1 000	32.8355	65.671	72.9985	11.15
2 000	64.2533	128.5066	143.3033	11.51
5 000	161.5249	323.0498	355.0302	9.89

Shared x2 vs Shared x4

TrainingArguments:

per_device_train_batch_size=12
per_device_eval_batch_size=12
dataloader_num_workers=2

Number of particles	Shared x2 [s]	Expected Shared x4 = 2*Shared x2 [s]	Actual Shared x4 [s]
500	39.187	78.374	77.2388
1 000	77.3014	154.6028	153.4177
2 000	154.294	308.588	306.0012
5 000	385.6539	771.3078	762.5113

Shared x4 vs Shared x8

TrainingArguments:

per_device_train_batch_size=4
per_device_eval_batch_size=4
dataloader_num_workers=1

Number of particles	Shared x4 [s]	Expected Shared x8 = 2*Shared x4 [s]	Shared x8 [s]	Loss [%]
500	104.6849	209.3698	212.6313	1.55
1 000	185.1633	370.3266	381.7454	3.08
2 000	397.8525	795.705	816.353	2.59
5 000	1001.752	2003.504	1999.2395	0

Conclusions

The loss when performing ML training on a GPU with time slicing enabled is negligible (<0.7%).
When scaling from shared x1 to shared x2 we get a ~2.2 times increase in overall computation (vs ideal 2). The time slicing loss is around 11% in this case.
If the number of processes increases (shared x4, shared x8), the performance is not influenced a lot (0-3%).

Takeaways

Considering the potential GPU utilization improvements, the penalty introduced by enabling time slicing but having only one process using the GPU (Shared x1) can be disregarded.
There is a variable penalty introduced when the GPU needs to perform context switching (shared x1 vs shared x2).
For more than 2 processes sharing the GPU, the execution time scales linearly (no extra penalty if we increase the number of processes sharing a GPU).
The penalty introduced by time slicing can potentially be very big, depending on the use case:
- When the GPU was running context switch sensitive workloads the penalty introduced by time slicing was about 38%.
- In case the program is IO bound, CPU bound, heavy on tensor core utilization, etc, the penalty alleviates. For instance, with the ML traning the penalty dropped to ~11%.
If the processes consume more memory than available, some of them will die because of OOM error. We cannot set limits or priorities for memory, we discussed earlier how we try to go around it (see more).

Time slicing can potentially introduce a big performance penalty. Even so, when applied to the correct use cases, it can be a powerful way of boosting GPU utilization. Consult the available overview to find more.

Next episode

In the next blog post, we will dive into the extensive MIG benchmarking. Stay tuned!

←Previous