LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • GPU Architecture Fundamentals
  • GPU Execution Model
  • Understanding Performance
  • Simplified Performance Model
  • Examples of Neural Network Operations with Their Arithmetic Intensities
  • DNN Operation Categories
  • Optimizing GPU Utilisation
  • Summary of Assessment Process for GPU Requirements

Was this helpful?

  1. Infrastructure
  2. Data and Memory

GPU Performance Optimisation

PreviousTransformer training costsNextLibraries and Complements

Last updated 11 months ago

Was this helpful?

Abstract

This guide provides background on the structure of a GPU, how operations are executed, and common limitations with deep learning operations.


1. Overview

Understanding the basics of GPU execution is helpful when reasoning about how efficiently particular layers or neural networks utilize a given GPU. This guide describes:

  • The basic structure of a GPU (GPU Architecture Fundamentals)

  • How operations are divided and executed in parallel (GPU Execution Model)

  • How to estimate performance limitations with arithmetic intensity (Understanding Performance)

  • Loose categories of deep learning operations and the performance limitations that apply to each (DNN Operation Categories)


GPU Architecture Fundamentals

The GPU is a highly parallel processor architecture, consisting of processing elements and a memory hierarchy.

NVIDIA® GPUs typically include a number of Streaming Multiprocessors (SMs), on-chip L2 cache, and high-bandwidth DRAM.

Arithmetic and other instructions are executed by the SMs; data and code are accessed from DRAM via the L2 cache.

For example, an NVIDIA A100 GPU contains 108 SMs, a 40 MB L2 cache, and up to 2039 GB/s bandwidth from 80 GB of HBM2 memory.

Each SM has its own instruction schedulers and various instruction execution pipelines.

Multiply-add operations, the most frequent operation in modern neural networks, act as a building block for fully-connected and convolutional layers, both viewed as a collection of vector dot-products.

The table below shows a single SM’s multiply-add operations per clock for various data types on NVIDIA’s recent GPU architectures.

Each multiply-add comprises two operations, thus one would multiply the throughput in the table by 2 to get FLOP counts per clock.

To get the FLOPS rate for GPU one would then multiply these by the number of SMs and SM clock rate. For example, an A100 GPU with 108 SMs and 1.41 GHz clock rate has peak dense throughputs of 156 TF32 TFLOPS and 312 FP16 TFLOPS.

As shown in Figure 2, FP16 operations can be executed in either Tensor Cores or NVIDIA CUDA® cores.

Furthermore, the NVIDIA Turing™ architecture can execute INT8 operations in either Tensor Cores or CUDA cores.

Tensor Cores were introduced in the NVIDIA Volta™ GPU architecture to accelerate matrix multiply and accumulate operations for machine learning and scientific applications.

These instructions operate on small matrix blocks (for example, 4x4 blocks). Note that Tensor Cores can compute and accumulate products in higher precision than the inputs.

For example, during training with FP16 inputs, Tensor Cores can compute products without loss of precision and accumulate in FP32.

When math operations cannot be formulated in terms of matrix blocks they are executed in other CUDA cores. For example, the element-wise addition of two half-precision tensors would be performed by CUDA cores, rather than Tensor Cores.


GPU Execution Model

To utilise their parallel resources, GPUs execute many threads concurrently. There are two critical concepts to understanding how thread count relates to GPU performance:

  • GPUs execute functions using a 2-level hierarchy of threads. Threads for a given function are grouped into equally-sized thread blocks, and a set of thread blocks are launched to execute the function.

  • GPUs hide dependent instruction latency by switching to the execution of other threads. Thus, the number of threads needed to effectively utilize a GPU is much higher than the number of cores or instruction pipelines.

The 2-level thread hierarchy results from GPUs having many SMs, each capable of executing many threads and enabling its threads to communicate via shared memory and synchronization.

At runtime, a thread block is placed on an SM for execution, allowing all threads in a thread block to communicate and synchronize efficiently.

Launching a function with a single thread block would only activate a single SM, therefore, to fully utilize a GPU with multiple SMs one needs to launch many thread blocks.

Since an SM can execute multiple thread blocks concurrently, typically one wants the number of thread blocks to be several times higher than the number of SMs to minimize the "tail" effect, where at the end of a function execution only a few active thread blocks remain, thus underutilizing the GPU.

Here, the blocks execute in 2 waves, the first wave utilizes 100% of the GPU, while the second wave utilizes only 50%.

This "tail effect" shows the inefficiency that occurs when fewer thread blocks are active towards the end of a function’s execution. Optimizing the number of thread blocks and understanding the execution model are crucial for achieving maximum GPU utilization.

Understanding Performance

Performance of a function on a given processor is determined by memory bandwidth, math bandwidth, and latency.

Consider a function that reads its input from memory, performs math operations, and then writes its output back to memory. The time spent can be analyzed as follows:

This demonstrates the performance limitation:

The time spent on memory and math depends on the algorithm, its implementation, and the processor's capabilities:

To determine if a function is math or memory limited, consider the following inequality:

Where:

Thus, a function is math-limited if its arithmetic intensity is higher than the processor’s ops:byte ratio. Conversely, it is memory-limited if the arithmetic intensity is lower.

Example Analysis

Let's consider examples from deep neural networks on an NVIDIA Volta V100 GPU:

  • V100 Specifications:

    • Peak math rate: 125 FP16 Tensor TFLOPS

    • Off-chip memory bandwidth: approx. 900 GB/s

    • On-chip L2 bandwidth: 3.1 TB/s

    • Ops:Byte ratio between 40 and 139, depending on the source of an operation’s data (on-chip or off-chip memory).

The performance of a function on a GPU is influenced by three primary factors: memory bandwidth, mathematical operation bandwidth (math bandwidth), and latency.

Simplified Performance Model

Consider a scenario where:

  • Memory Time: The time spent accessing input from memory and writing output to memory.

  • Math Time: The time spent performing mathematical computations.

If these operations can overlap (concurrent execution of memory and math tasks), the total time for a function is determined by the longer of the two durations.

This leads to:

  • Math-limited: The function is considered math-limited if the math time exceeds the memory time.

  • Memory-limited: Conversely, if the memory time is longer, the function is memory-limited.

The time spent on memory or mathematical operations depends on both the algorithm's design and its implementation, as well as the processor’s capabilities:

  • Memory Time = (Number of bytes accessed / Memory bandwidth)

  • Math Time = (Number of operations / Math bandwidth)

A function is math-limited if the following condition holds true:

Where:

  • Arithmetic Intensity is the ratio of the number of operations to the number of bytes accessed, defined as:

  • Ops:Byte Ratio is the ratio of the processor's math bandwidth to its memory bandwidth, defined as:

Examples of Neural Network Operations with Their Arithmetic Intensities

This table lists typical neural network operations, their arithmetic intensity values, and the typical limiting factor (whether they are arithmetic or memory limited) when using FP16 data and an NVIDIA Volta V100 GPU.

Operation
Arithmetic Intensity
Usually limited by

Linear layer (4096 outputs, 1024 inputs, batch size 512)

315 FLOPS/B

arithmetic

Linear layer (4096 outputs, 1024 inputs, batch size 1)

1 FLOPS/B

memory

Max pooling with 3x3 window and unit stride

2.25 FLOPS/B

memory

ReLU activation

0.25 FLOPS/B

memory

Layer normalization

< 10 FLOPS/B

memory

As the table illustrates, many common operations have low arithmetic intensities - sometimes only performing a single operation per two-byte element read from and written to memory.

This type of analysis is a simplification, as it counts only the algorithmic operations used. In practice, functions also contain instructions for operations not explicitly expressed in the algorithm, such as memory access instructions, address calculation instructions, control flow instructions, and so on.

The arithmetic intensity and ops:byte ratio analysis assumes that a workload is sufficiently large to saturate a given processor’s math and memory pipelines.

However, if the workload is not large enough, or does not have sufficient parallelism, the processor will be under-utilised, and performance will be limited by latency.

For example, consider the launch of a single thread that will access 16 bytes and perform 16,000 math operations.

While the arithmetic intensity is 1000 FLOPS/B and the execution should be math-limited on a V100 GPU, creating only a single thread grossly under-utilises the GPU, leaving nearly all of its math pipelines and execution resources idle.

Furthermore, the arithmetic intensity calculation assumes that inputs and outputs are accessed from memory exactly once.

It is not unusual for algorithm implementations to read input elements multiple times, which would effectively reduce arithmetic intensity. Thus, the arithmetic intensity is a first-order approximation; profiler information should be used if more accurate analysis is needed.

DNN Operation Categories

Deep Neural Networks (DNNs) utilize various layers, categorized based on their computational characteristics:

Elementwise Operations

These operations include unary and binary functions that apply a mathematical operation independently to each element of the tensor. Examples include ReLU, sigmoid, and addition. These are generally memory-limited because they perform a relatively small number of operations per byte of data accessed.

Reduction Operations

Reduction operations generate outputs by aggregating over ranges of inputs, such as pooling layers or batch normalization. These typically have low arithmetic intensities and are usually memory-limited due to their operational nature.

Dot-Product Operations

This category covers operations expressed as dot products between tensors, including fully-connected layers and convolutions. These operations can be either math-limited or memory-limited, depending on the size of the matrices involved. Large matrix operations tend to be math-limited, while smaller ones might be memory-limited.

Figure 4. Diagrammatic Representation of Dot-Product Operations dot-product-op.svg

Optimizing GPU Utilisation

Optimizing GPU performance involves adjusting the scale of thread operations and maximizing the use of the GPU’s mathematical and memory handling capabilities. Understanding these categories helps in designing efficient neural networks that make the best use of the available hardware resources.

To create a comprehensive process for assessing functions and modeling GPU requirements for deep learning or other GPU-intensive applications, follow this structured approach. It outlines how to determine what limits the performance of a particular function on a given GPU and what you might do to address these limitations.

Summary of Assessment Process for GPU Requirements

Step 1: Understand GPU Specifications

  • Number of SMs: Look up the number of Streaming Multiprocessors (SMs) on the GPU. This will give you an indication of the parallel processing power of the GPU.

  • OpsRatio: Determine the opsratio for the GPU. This ratio helps in understanding the balance between computational power and memory bandwidth.

Step 2: Compute Arithmetic Intensity

  • Arithmetic Intensity Calculation: Compute the arithmetic intensity of the algorithm, which is the ratio of the number of operations (flops) to the number of bytes accessed. This measure helps determine whether the algorithm is computationally heavy or memory heavy.

Step 3: Estimate GPU Utilisation

  • Parallelism Assessment: Determine if there is sufficient parallelism to effectively utilize the GPU. This involves estimating the number and size of thread blocks:

    • If the number of thread blocks is at least roughly four times higher than the number of SMs and each thread block consists of a few hundred threads, then there is likely sufficient parallelism.

    • Insufficient thread blocks or threads per block may indicate that the GPU will not be fully utilized.

Step 4: Reference Specific Guides for Optimization

  • Layer-Specific Optimization Guides: Consult NVIDIA's specific optimization guides based on the type of layers or operations you are using. For example:

    • Linear/Fully-Connected Layers: Look for techniques in the NVIDIA Optimizing Linear/Fully-Connected Layers User's Guide.

    • Convolutional Layers: Refer to the NVIDIA Optimizing Convolutional Layers User's Guide.

    • Recurrent Layers: Check the NVIDIA Optimizing Recurrent Layers User's Guide.

    • Memory-Bound Layers: While typically memory-limited, you can find useful tips in the NVIDIA Optimizing Memory-Bound Layers User's Guide.

Step 5: Determine the Performance Limiter

  • Identifying Limiters: Based on the arithmetic intensity and parallelism, determine the most likely performance limiter:

    • Latency: If there is not sufficient parallelism, latency due to inadequate utilization of computational resources is likely the limiter.

    • Math: If there is sufficient parallelism and the arithmetic intensity is higher than the GPU's opsratio, then the performance is likely math-limited.

    • Memory: If there is sufficient parallelism and the arithmetic intensity is lower than the GPU's opsratio, then the performance is likely memory-limited.

Conclusion

This process helps in systematically evaluating and optimizing the performance of various functions on GPUs, especially for deep learning applications.

By understanding the interplay between hardware specifications, algorithm characteristics, and execution models, developers can better harness the computational capabilities of GPUs.

Memory Time (tmemory) ( t_{\text{memory}} )(tmemory​): Time spent accessing memory.

Math Time (tmath) ( t_{\text{math}} )(tmath​): Time spent performing math operations.

Assuming memory and math operations can overlap, the total time (T)( T ) (T) for the function can be represented as: [T=max⁡(tmemory,tmath)] [ T = \max(t_{\text{memory}}, t_{\text{math}}) ][T=max(tmemory​,tmath​)]

If (tmath>tmemory)( t_{\text{math}} > t_{\text{memory}} )(tmath​>tmemory​), the function is math-limited.

If (tmemory>tmath) ( t_{\text{memory}} > t_{\text{math}} )(tmemory​>tmath​), the function is memory-limited.

Memory Time (tmemory)( t_{\text{memory}} ) (tmemory​)is calculated as: [tmemory=Number of bytes accessedMemory bandwidth] [ t_{\text{memory}} = \frac{\text{Number of bytes accessed}}{\text{Memory bandwidth}} ][tmemory​=Memory bandwidthNumber of bytes accessed​]

Math Time (tmath)( t_{\text{math}} )(tmath​) is calculated as: [tmath=Number of operationsMath bandwidth][ t_{\text{math}} = \frac{\text{Number of operations}}{\text{Math bandwidth}} ][tmath​=Math bandwidthNumber of operations​]

[Number of operationsNumber of bytes accessed>Math bandwidthMemory bandwidth] [ \frac{\text{Number of operations}}{\text{Number of bytes accessed}} > \frac{\text{Math bandwidth}}{\text{Memory bandwidth}} ][Number of bytes accessedNumber of operations​>Memory bandwidthMath bandwidth​]

This can be rearranged to: [Arithmetic Intensity>Ops:Byte Ratio] [ \text{Arithmetic Intensity} > \text{Ops:Byte Ratio} ][Arithmetic Intensity>Ops:Byte Ratio]

Arithmetic Intensity is the ratio of the number of operations to the number of bytes accessed: [Arithmetic Intensity=Number of operationsNumber of bytes accessed][ \text{Arithmetic Intensity} = \frac{\text{Number of operations}}{\text{Number of bytes accessed}} ][Arithmetic Intensity=Number of bytes accessedNumber of operations​]

Ops:Byte Ratio is the ratio of the processor's math bandwidth to its memory bandwidth: [Ops:Byte Ratio=Math bandwidthMemory bandwidth] [ \text{Ops:Byte Ratio} = \frac{\text{Math bandwidth}}{\text{Memory bandwidth}} ][Ops:Byte Ratio=Memory bandwidthMath bandwidth​]

Arithmetic Intensity>Ops:Byte Ratio\text{Arithmetic Intensity} > \text{Ops:Byte Ratio} Arithmetic Intensity>Ops:Byte Ratio
Arithmetic Intensity=Number of operationsNumber of bytes accessed \text{Arithmetic Intensity} = \frac{\text{Number of operations}}{\text{Number of bytes accessed}} Arithmetic Intensity=Number of bytes accessedNumber of operations​
Ops:Byte Ratio=Math bandwidthMemory bandwidth \text{Ops:Byte Ratio} = \frac{\text{Math bandwidth}}{\text{Memory bandwidth}} Ops:Byte Ratio=Memory bandwidthMath bandwidth​
https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html
Figure 1. Simplified view of the GPU architecture
Figure 2. Multiply-add operations per clock per SM
Figure 3. Utilisation of an 8-SM GPU when 12 thread blocks with an occupancy of 1 block/SM at a time are launched for execution.