GPU Performance Optimisation
Last updated
Last updated
Copyright Continuum Labs - 2023
https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html
This guide provides background on the structure of a GPU, how operations are executed, and common limitations with deep learning operations.
Understanding the basics of GPU execution is helpful when reasoning about how efficiently particular layers or neural networks utilize a given GPU. This guide describes:
The basic structure of a GPU (GPU Architecture Fundamentals)
How operations are divided and executed in parallel (GPU Execution Model)
How to estimate performance limitations with arithmetic intensity (Understanding Performance)
Loose categories of deep learning operations and the performance limitations that apply to each (DNN Operation Categories)
The GPU is a highly parallel processor architecture, consisting of processing elements and a memory hierarchy.
NVIDIA® GPUs typically include a number of Streaming Multiprocessors (SMs), on-chip L2 cache, and high-bandwidth DRAM.
Arithmetic and other instructions are executed by the SMs; data and code are accessed from DRAM via the L2 cache.
For example, an NVIDIA A100 GPU contains 108 SMs, a 40 MB L2 cache, and up to 2039 GB/s bandwidth from 80 GB of HBM2 memory.
Each SM has its own instruction schedulers and various instruction execution pipelines.
Multiply-add operations, the most frequent operation in modern neural networks, act as a building block for fully-connected and convolutional layers, both viewed as a collection of vector dot-products.
The table below shows a single SM’s multiply-add operations per clock for various data types on NVIDIA’s recent GPU architectures.
Each multiply-add comprises two operations, thus one would multiply the throughput in the table by 2 to get FLOP counts per clock.
To get the FLOPS rate for GPU one would then multiply these by the number of SMs and SM clock rate. For example, an A100 GPU with 108 SMs and 1.41 GHz clock rate has peak dense throughputs of 156 TF32 TFLOPS and 312 FP16 TFLOPS.
As shown in Figure 2, FP16 operations can be executed in either Tensor Cores or NVIDIA CUDA® cores.
Furthermore, the NVIDIA Turing™ architecture can execute INT8 operations in either Tensor Cores or CUDA cores.
Tensor Cores were introduced in the NVIDIA Volta™ GPU architecture to accelerate matrix multiply and accumulate operations for machine learning and scientific applications.
These instructions operate on small matrix blocks (for example, 4x4 blocks). Note that Tensor Cores can compute and accumulate products in higher precision than the inputs.
For example, during training with FP16 inputs, Tensor Cores can compute products without loss of precision and accumulate in FP32.
When math operations cannot be formulated in terms of matrix blocks they are executed in other CUDA cores. For example, the element-wise addition of two half-precision tensors would be performed by CUDA cores, rather than Tensor Cores.
To utilise their parallel resources, GPUs execute many threads concurrently. There are two critical concepts to understanding how thread count relates to GPU performance:
GPUs execute functions using a 2-level hierarchy of threads. Threads for a given function are grouped into equally-sized thread blocks, and a set of thread blocks are launched to execute the function.
GPUs hide dependent instruction latency by switching to the execution of other threads. Thus, the number of threads needed to effectively utilize a GPU is much higher than the number of cores or instruction pipelines.
The 2-level thread hierarchy results from GPUs having many SMs, each capable of executing many threads and enabling its threads to communicate via shared memory and synchronization.
At runtime, a thread block is placed on an SM for execution, allowing all threads in a thread block to communicate and synchronize efficiently.
Launching a function with a single thread block would only activate a single SM, therefore, to fully utilize a GPU with multiple SMs one needs to launch many thread blocks.
Since an SM can execute multiple thread blocks concurrently, typically one wants the number of thread blocks to be several times higher than the number of SMs to minimize the "tail" effect, where at the end of a function execution only a few active thread blocks remain, thus underutilizing the GPU.
Here, the blocks execute in 2 waves, the first wave utilizes 100% of the GPU, while the second wave utilizes only 50%.
This "tail effect" shows the inefficiency that occurs when fewer thread blocks are active towards the end of a function’s execution. Optimizing the number of thread blocks and understanding the execution model are crucial for achieving maximum GPU utilization.
Performance of a function on a given processor is determined by memory bandwidth, math bandwidth, and latency.
Consider a function that reads its input from memory, performs math operations, and then writes its output back to memory. The time spent can be analyzed as follows:
Memory Time : Time spent accessing memory.
Math Time : Time spent performing math operations.
Assuming memory and math operations can overlap, the total time for the function can be represented as:
This demonstrates the performance limitation:
If , the function is math-limited.
If , the function is memory-limited.
The time spent on memory and math depends on the algorithm, its implementation, and the processor's capabilities:
Memory Time is calculated as:
Math Time is calculated as:
To determine if a function is math or memory limited, consider the following inequality:
This can be rearranged to:
Where:
Arithmetic Intensity is the ratio of the number of operations to the number of bytes accessed:
Ops:Byte Ratio is the ratio of the processor's math bandwidth to its memory bandwidth:
Thus, a function is math-limited if its arithmetic intensity is higher than the processor’s ops:byte ratio. Conversely, it is memory-limited if the arithmetic intensity is lower.
Let's consider examples from deep neural networks on an NVIDIA Volta V100 GPU:
V100 Specifications:
Peak math rate: 125 FP16 Tensor TFLOPS
Off-chip memory bandwidth: approx. 900 GB/s
On-chip L2 bandwidth: 3.1 TB/s
Ops:Byte ratio between 40 and 139, depending on the source of an operation’s data (on-chip or off-chip memory).
The performance of a function on a GPU is influenced by three primary factors: memory bandwidth, mathematical operation bandwidth (math bandwidth), and latency.
Consider a scenario where:
Memory Time: The time spent accessing input from memory and writing output to memory.
Math Time: The time spent performing mathematical computations.
If these operations can overlap (concurrent execution of memory and math tasks), the total time for a function is determined by the longer of the two durations.
This leads to:
Math-limited: The function is considered math-limited if the math time exceeds the memory time.
Memory-limited: Conversely, if the memory time is longer, the function is memory-limited.
The time spent on memory or mathematical operations depends on both the algorithm's design and its implementation, as well as the processor’s capabilities:
Memory Time = (Number of bytes accessed / Memory bandwidth)
Math Time = (Number of operations / Math bandwidth)
A function is math-limited if the following condition holds true:
Where:
Arithmetic Intensity is the ratio of the number of operations to the number of bytes accessed, defined as:
Ops:Byte Ratio is the ratio of the processor's math bandwidth to its memory bandwidth, defined as:
This table lists typical neural network operations, their arithmetic intensity values, and the typical limiting factor (whether they are arithmetic or memory limited) when using FP16 data and an NVIDIA Volta V100 GPU.
Operation | Arithmetic Intensity | Usually limited by |
---|---|---|
Linear layer (4096 outputs, 1024 inputs, batch size 512) | 315 FLOPS/B | arithmetic |
Linear layer (4096 outputs, 1024 inputs, batch size 1) | 1 FLOPS/B | memory |
Max pooling with 3x3 window and unit stride | 2.25 FLOPS/B | memory |
ReLU activation | 0.25 FLOPS/B | memory |
Layer normalization | < 10 FLOPS/B | memory |
As the table illustrates, many common operations have low arithmetic intensities - sometimes only performing a single operation per two-byte element read from and written to memory.
This type of analysis is a simplification, as it counts only the algorithmic operations used. In practice, functions also contain instructions for operations not explicitly expressed in the algorithm, such as memory access instructions, address calculation instructions, control flow instructions, and so on.
The arithmetic intensity and ops:byte ratio analysis assumes that a workload is sufficiently large to saturate a given processor’s math and memory pipelines.
However, if the workload is not large enough, or does not have sufficient parallelism, the processor will be under-utilised, and performance will be limited by latency.
For example, consider the launch of a single thread that will access 16 bytes and perform 16,000 math operations.
While the arithmetic intensity is 1000 FLOPS/B and the execution should be math-limited on a V100 GPU, creating only a single thread grossly under-utilises the GPU, leaving nearly all of its math pipelines and execution resources idle.
Furthermore, the arithmetic intensity calculation assumes that inputs and outputs are accessed from memory exactly once.
It is not unusual for algorithm implementations to read input elements multiple times, which would effectively reduce arithmetic intensity. Thus, the arithmetic intensity is a first-order approximation; profiler information should be used if more accurate analysis is needed.
Deep Neural Networks (DNNs) utilize various layers, categorized based on their computational characteristics:
Elementwise Operations
These operations include unary and binary functions that apply a mathematical operation independently to each element of the tensor. Examples include ReLU, sigmoid, and addition. These are generally memory-limited because they perform a relatively small number of operations per byte of data accessed.
Reduction Operations
Reduction operations generate outputs by aggregating over ranges of inputs, such as pooling layers or batch normalization. These typically have low arithmetic intensities and are usually memory-limited due to their operational nature.
Dot-Product Operations
This category covers operations expressed as dot products between tensors, including fully-connected layers and convolutions. These operations can be either math-limited or memory-limited, depending on the size of the matrices involved. Large matrix operations tend to be math-limited, while smaller ones might be memory-limited.
Figure 4. Diagrammatic Representation of Dot-Product Operations dot-product-op.svg
Optimizing GPU performance involves adjusting the scale of thread operations and maximizing the use of the GPU’s mathematical and memory handling capabilities. Understanding these categories helps in designing efficient neural networks that make the best use of the available hardware resources.
To create a comprehensive process for assessing functions and modeling GPU requirements for deep learning or other GPU-intensive applications, follow this structured approach. It outlines how to determine what limits the performance of a particular function on a given GPU and what you might do to address these limitations.
Step 1: Understand GPU Specifications
Number of SMs: Look up the number of Streaming Multiprocessors (SMs) on the GPU. This will give you an indication of the parallel processing power of the GPU.
OpsRatio: Determine the opsratio for the GPU. This ratio helps in understanding the balance between computational power and memory bandwidth.
Step 2: Compute Arithmetic Intensity
Arithmetic Intensity Calculation: Compute the arithmetic intensity of the algorithm, which is the ratio of the number of operations (flops) to the number of bytes accessed. This measure helps determine whether the algorithm is computationally heavy or memory heavy.
Step 3: Estimate GPU Utilisation
Parallelism Assessment: Determine if there is sufficient parallelism to effectively utilize the GPU. This involves estimating the number and size of thread blocks:
If the number of thread blocks is at least roughly four times higher than the number of SMs and each thread block consists of a few hundred threads, then there is likely sufficient parallelism.
Insufficient thread blocks or threads per block may indicate that the GPU will not be fully utilized.
Step 4: Reference Specific Guides for Optimization
Layer-Specific Optimization Guides: Consult NVIDIA's specific optimization guides based on the type of layers or operations you are using. For example:
Linear/Fully-Connected Layers: Look for techniques in the NVIDIA Optimizing Linear/Fully-Connected Layers User's Guide.
Convolutional Layers: Refer to the NVIDIA Optimizing Convolutional Layers User's Guide.
Recurrent Layers: Check the NVIDIA Optimizing Recurrent Layers User's Guide.
Memory-Bound Layers: While typically memory-limited, you can find useful tips in the NVIDIA Optimizing Memory-Bound Layers User's Guide.
Step 5: Determine the Performance Limiter
Identifying Limiters: Based on the arithmetic intensity and parallelism, determine the most likely performance limiter:
Latency: If there is not sufficient parallelism, latency due to inadequate utilization of computational resources is likely the limiter.
Math: If there is sufficient parallelism and the arithmetic intensity is higher than the GPU's opsratio, then the performance is likely math-limited.
Memory: If there is sufficient parallelism and the arithmetic intensity is lower than the GPU's opsratio, then the performance is likely memory-limited.
This process helps in systematically evaluating and optimizing the performance of various functions on GPUs, especially for deep learning applications.
By understanding the interplay between hardware specifications, algorithm characteristics, and execution models, developers can better harness the computational capabilities of GPUs.