LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Technological Foundations and Architecture
  • Hopper Architecture
  • TSMC's Custom 4N Process
  • Transistor Count and Die Size
  • Streaming Multiprocessors (SMs)
  • Fourth-Generation Tensor Cores
  • Dynamic Programming Acceleration with DPX Instructions
  • H100 GPU hierarchy and asynchrony improvements
  • Thread block clusters
  • Asynchronous Execution Enhancements
  • HBM3 Memory
  • Multi-Instance GPU (MIG)
  • Let's review the specifications of the NVIDIA H100 GPU
  • FP32 Performance
  • Tensor Performance
  • Memory Bandwidth
  • NVLink and PCIe Gen5
  • TDP and Power Efficiency
  • Integration with Other Technologies
  • NVIDIA AI Enterprise
  • NVIDIA Magnum IO
  • NVIDIA Quantum InfiniBand
  • NVIDIA NVLink Switch System
  • NVIDIA H100 Tensor Core GPU performance specifications
  • Explanation of IEE standard
  • The components of the floating-point representation of numbers
  • Number types

Was this helpful?

  1. Infrastructure
  2. Servers and Chips

The NVIDIA H100 GPU

PreviousServers and ChipsNextNVIDIA H100 NVL

Last updated 11 months ago

Was this helpful?

The H100 is a graphics processing unit (GPU) chip manufactured by Nvidia.

It is currently the most powerful GPU chip on the market and is designed specifically for artificial intelligence (AI) applications. Each chip costs around $US75,000.

The H100 is currently in high demand due to its powerful performance and its ability to accelerate AI applications.

Technological Foundations and Architecture

Hopper Architecture

The H-100 is built around the Hopper architecture - the latest GPU architecture developed by NVIDIA, named after the renowned computer scientist Grace Hopper.

It succeeds the previous Ampere architecture and introduces various improvements and new features to enhance performance and efficiency.

The Hopper architecture represents a lift in GPU technology, enabling faster processing, improved memory bandwidth, and more advanced features compared to its predecessor.

It lays the foundation for the H100 GPU to deliver exceptional performance across a wide range of AI, high performance computing, and data analytics workloads.

TSMC's Custom 4N Process

TSMC (Taiwan Semiconductor Manufacturing Company) is a leading semiconductor foundry that manufactures chips for various companies, including NVIDIA.

The custom 4N process is a specialised manufacturing process developed by TSMC specifically for NVIDIA's H100 GPU, optimised for high performance and energy efficiency.

The custom 4N process allows NVIDIA to pack more transistors into a smaller area, enabling higher clock speeds and improved power efficiency compared to previous manufacturing processes.

This advanced process technology is critical for achieving the H100's performance.

Transistor Count and Die Size

Transistors are the basic building blocks of modern electronics, and the number of transistors on a chip is a key indicator of its complexity and potential performance. The die size refers to the physical dimensions of the chip.

The H100 GPU features an astonishing 80 billion transistors, which is a massive increase compared to the previous generation A100 GPU, which had 54.2 billion transistors.

To put this into perspective, the H100's transistor count is more than the combined population of several countries.

The large 814 mm² die size enables the integration of more processing units and memory interfaces, contributing to the H100's exceptional performance.

Streaming Multiprocessors (SMs)

SMs are the primary processing units within an NVIDIA GPU, responsible for executing parallel threads and performing complex computations.

SMs (Streaming Multiprocessors) are the fundamental processing units that perform the parallel computations. They are similar to the cores in a CPU, but are designed specifically for parallel processing.

SMs are composed of several types of arithmetic units, including INT32 units for mixed-precision integer operations, FP32 units (also known as CUDA cores) for single-precision floating-point operations, and FP64 units for double-precision floating-point operations.

What are these different data types?

The terms INT32, FP32, and FP64 refer to different data types and their corresponding arithmetic units within the GPU architecture.

INT32

  • INT32 stands for 32-bit integer data type.

  • It represents whole numbers (positive, negative or zero) within the range of −231-2^{31} −231to 2312^{31}231

  • The smallest value for a 32-bit integer is −231-2^{31}−231, which equals −2,147,483,648

  • The largest value for a 32-bit integer is 2312^{31}231, which equals 2,147,483,647.

  • INT32 units in GPUs are designed to perform integer arithmetic operations, such as addition, subtraction, multiplication, and division on 32-bit integers.

  • These units are particularly useful for tasks that require precise integer calculations, such as indexing, addressing, and certain computer graphics operations.

FP32 (CUDA cores)

  • FP32 stands for 32-bit floating-point data type, also known as single-precision floating-point.

  • It follows the IEEE 754 standard for floating-point arithmetic.

  • FP32 numbers have a significand (mantissa) of 24 bits, an exponent of 8 bits, and a sign bit, allowing for a wide range of representable values.

  • FP32 units, often referred to as CUDA cores in NVIDIA GPUs, are specialised arithmetic units designed to perform single-precision floating-point operations.

  • These units are crucial for many computational tasks, including graphics rendering, scientific simulations, and machine learning, where high precision is not always necessary, and the focus is on performance and memory efficiency.

FP64

  • FP64 stands for 64-bit floating-point data type, also known as double-precision floating-point.

  • It also follows the IEEE 754 standard but provides higher precision and a wider range of representable values compared to FP32.

  • FP64 numbers have a significand of 53 bits, an exponent of 11 bits, and a sign bit.

  • FP64 units in GPUs are designed to perform double-precision floating-point arithmetic operations.

  • These units are essential for scientific and engineering applications that require high accuracy, such as computational fluid dynamics, finite element analysis, and certain deep learning models.

The presence of INT32, FP32, and FP64 units within the Streaming Multiprocessors (SMs) of a GPU allows for efficient execution of various types of arithmetic operations.

The SMs are the fundamental processing units of a GPU, and they contain multiple arithmetic units working in parallel to achieve high throughput.

The ratio and number of these arithmetic units within an SM can vary depending on the GPU architecture and the intended target applications.

For example, GPUs designed for gaming and graphics rendering may have a higher ratio of FP32 units to FP64 units, as single-precision is often sufficient for these tasks.

On the other hand, GPUs aimed at scientific computing and simulation may have a more balanced ratio or even a higher number of FP64 units to cater to the demands of double-precision computations.

The availability of these different arithmetic units allows developers to optimise their algorithms and computations based on the specific requirements of their applications.

By carefully mapping the computational tasks to the appropriate data types and using the corresponding arithmetic units, developers can achieve optimal performance and efficiency on GPUs.

Each SM contains a set of CUDA Cores (for general-purpose computing), Tensor Cores (for AI and deep learning), and RT Cores (for ray tracing).

The H100 GPU boasts 132 SMs, a significant increase from the A100's 108 SMs.

Each SM in the H100 is equipped with 128 FP32 CUDA Cores, 4 fourth-generation Tensor Cores, and 6 third-generation RT Cores.

What are Ray Tracing Cores (RT Cores)?

Ray tracing cores (RT Cores) are dedicated hardware units within NVIDIA GPUs that are specifically designed to accelerate ray tracing operations.

Ray tracing is a rendering technique that simulates the physical behavior of light by tracing the path of rays as they interact with objects in a virtual scene.

RT Cores work in conjunction with NVIDIA's RTX software to enable real-time ray tracing in graphics applications.

Practical applications of RT Cores and real-time ray tracing include:

Photorealistic rendering: RT Cores allow developers to create highly realistic scenes with physically accurate lighting, shadows, reflections, and global illumination. This enhances the visual fidelity of games, architectural visualisations, product designs, and other graphics-intensive applications.

Real-time global illumination: NVIDIA's RTX Global Illumination (RTXGI) technology leverages RT Cores to provide multi-bounce indirect lighting in real-time without the need for time-consuming baking processes or expensive per-frame computations.

Real-time dynamic illumination: RTX Dynamic Illumination (RTXDI) uses RT Cores to enable the rendering of millions of dynamic lights in real-time, enhancing the realism of night and indoor scenes with a high number of light sources.

Improved performance and scalability: RT Cores offload ray tracing computations from the main GPU cores, allowing for efficient execution of ray tracing operations. This enables developers to incorporate ray tracing into their applications while maintaining real-time performance, even with limited rays per pixel.

Integration with AI-based techniques: RT Cores can be used in combination with AI-based techniques like NVIDIA DLSS (Deep Learning Super Sampling) to further improve performance and image quality. DLSS uses deep learning to reconstruct higher-resolution images from lower-resolution inputs, reducing the computational burden on the GPU.

Enhanced visual effects: Ray tracing enables the creation of realistic visual effects such as accurate reflections, refractions, and translucency. RT Cores accelerate these computations, allowing developers to incorporate these effects into their applications without sacrificing performance.

In summary, RT Cores are specialised hardware units in NVIDIA GPUs that accelerate ray tracing operations, enabling real-time ray tracing in various graphics applications.

They offer benefits such as photorealistic rendering, real-time global illumination, dynamic lighting, improved performance and scalability, integration with AI-based techniques, and enhanced visual effects.

RT Cores have practical applications in gaming, architectural visualisation, product design, and other domains where high-quality, real-time graphics are essential.

This substantial increase in the number and capability of the SMs allows the H100 to handle more complex and demanding workloads with improved efficiency.

The picture below shows the SM architecture. The smallest boxes here with “INT32'“ or “FP32” or “FP64” represent the hardware or ‘execution units’ to perform single 32-bit integer, 32-bit floating point or 64-bit floating point operations.

So in this case we have thirty-two 32-bit floating point execution units, and sixteen each of the 32-bit integer and 64-bit float execution unit

You can see a large green box called Tensor Core. We will discuss that later on.

Fourth-Generation Tensor Cores

Tensor Cores are a key component of modern NVIDIA GPUs, and they work in conjunction with the other arithmetic units within the SMs (Streaming Multiprocessors) to deliver high-performance computing capabilities.

Tensor Cores are processing units designed for accelerating matrix multiplication and convolution operations, which are the foundation of deep learning and AI algorithms.

The fourth-generation Tensor Cores in the H100 introduce support for new precision formats and offer higher performance compared to the previous generation.

The H100's fourth-generation Tensor Cores deliver up to 6x higher performance than the A100's Tensor Cores, enabling faster training and inference of AI models.

They support a wide range of precision formats, including FP8, FP16, bfloat16, TF32, and FP64, allowing users to choose the optimal balance between precision and performance for their specific workloads.

Dynamic Programming Acceleration with DPX Instructions

Dynamic Programming (DP) is an algorithmic technique that solves complex problems by breaking them down into simpler sub-problems.

The solutions to these sub-problems are stored and reused, reducing the overall computational complexity.

The H100 GPU introduces DPX instructions, which accelerate the performance of DP algorithms by up to 7 times compared to the previous generation NVIDIA Ampere GPUs.

DPX instructions provide support for advanced fused operands in the inner loop of many DP algorithms, resulting in faster times-to-solution for applications in fields like , robotics ( for optimal route finding), and graph analytics.

H100 GPU hierarchy and asynchrony improvements

Two essential keys to achieving high performance in parallel programs are data locality and asynchronous execution.

By moving program data as close as possible to the execution units, a programmer can exploit the performance that comes from having lower latency and higher bandwidth access to local data.​ ​

Asynchronous execution involves finding independent tasks to overlap with memory transfers and other processing. The goal is to keep all the units in the GPU fully used.

The NVIDIA Hopper adds an important new tier to the GPU programming hierarchy that exposes locality at a scale larger than a single thread block on a single SM.

Thread block clusters

The H100 introduces a new thread block cluster architecture that exposes control of locality at a granularity larger than a single thread block on a single SM.

A cluster is a group of thread blocks that are guaranteed to be concurrently scheduled onto a group of streaming multiprocessors (SMs), where the goal is to enable efficient cooperation of threads across multiple SMs - they are physically close together.

This allows for efficient cooperation of threads across multiple SMs.

The CUDA programming model has long relied on a GPU compute architecture that uses grids containing multiple thread blocks to leverage locality in a program.

Thread block clusters extend the CUDA programming model and add another level to the GPU’s physical programming hierarchy to include threads, thread blocks, thread block clusters, and grids.

The clusters in H100 run concurrently across SMs within a GPC.

An detailed explanation of Cooperative Groups in CUDA

Cooperative Groups is an extension to the CUDA programming model that allows developers to organise groups of threads that can communicate and synchronise with each other.

It provides a way to express the granularity at which threads are cooperating, enabling richer and more efficient parallel patterns.

Traditionally, CUDA programming relied on a single construct for synchronising threads:

the __syncthreads() function, which creates a barrier across all threads within a thread block.

However, developers often needed more flexibility to define and synchronise groups of threads at different granularities to achieve better performance and design flexibility.

Relation to the GPU

In CUDA, the GPU executes threads in groups called warps (typically 32 threads per warp).

Warps are further organised into thread blocks, which can contain multiple warps. Thread blocks are then grouped into a grid, which represents the entire kernel launch.

Cooperative Groups allows developers to work with these different levels of thread hierarchy and define their own groups of threads for synchronisation and communication purposes.

It provides a way to express the cooperation among threads at the warp level, thread block level, or even across multiple thread blocks.

Key Concepts in Cooperative Groups

Thread Groups: Cooperative Groups introduces the concept of thread groups, which are objects that represent a set of threads that can cooperate. Thread groups can be created based on the existing CUDA thread hierarchy (e.g., thread blocks) or by partitioning larger groups into smaller subgroups.

Synchronisation: Cooperative Groups provides synchronisation primitives, such as group-wide barriers (e.g., thread_group::sync()), which allow threads within a group to synchronise and ensure that all threads have reached a certain point before proceeding.

Collective Operations: Cooperative Groups supports collective operations, such as reductions (e.g., reduce()) and scans (e.g., exclusive_scan()), which perform operations across all threads in a group. These operations can take advantage of hardware acceleration on supported devices.

Partitioning: Cooperative Groups allows partitioning larger thread groups into smaller subgroups using operations like tiled_partition(). This enables developers to divide the workload among subgroups and perform more fine-grained synchronisation and communication.

Optimisation with Cooperative Groups

Cooperative Groups can be used to optimise CUDA code in several ways:

Improved Synchronisation: By using group-wide synchronisation primitives instead of global barriers, developers can minimise the synchronisation overhead and avoid unnecessary waiting for threads that are not involved in a particular operation.

Data Locality: Cooperative Groups allows developers to express data locality by partitioning thread groups and assigning specific tasks to subgroups. This can lead to better cache utilisation and reduced memory access latency.

Parallel Reduction: Collective operations like reduce() can be used to perform efficient parallel reductions within a group of threads. This can significantly speed up operations that involve aggregating values across threads.

Fine-Grained Parallelism: By partitioning thread groups into smaller subgroups, developers can exploit fine-grained parallelism and distribute the workload more effectively across the available GPU resources.

Warp-Level Primitives: Cooperative Groups provides warp-level primitives, such as shuffle operations, which enable efficient communication and data exchange among threads within a warp. These primitives can be used to optimise algorithms that require data sharing and collaboration among threads.

Example Use Case

Let's consider a scenario where you have a large array of data, and you want to perform a parallel reduction to calculate the sum of all elements.

With Cooperative Groups, you can partition the thread block into smaller subgroups (e.g., tiles of 32 threads) and perform the reduction within each subgroup using the reduce() operation.

Then, you can further reduce the partial sums from each subgroup to obtain the final result. This approach can lead to faster and more efficient parallel reduction compared to a naïve implementation that relies solely on global barriers.

Conclusion

Cooperative Groups is a powerful extension to the CUDA programming model that enables developers to express thread cooperation and synchronisation at different granularities.

By leveraging the concepts of thread groups, synchronisation primitives, and collective operations, developers can optimise their CUDA code for better performance, data locality, and parallel efficiency.

Cooperative Groups allows for fine-grained control over thread collaboration, enabling more sophisticated parallel patterns and algorithms on the GPU.

It's important to note that the effective use of Cooperative Groups requires careful consideration of the problem at hand and the specific GPU architecture being targeted.

Developers should experiment with different group sizes, partitioning strategies, and collective operations to find the optimal configuration for their specific use case.

Overall, Cooperative Groups provides a flexible and expressive way to harness the power of GPU parallelism, allowing developers to write more efficient and scalable CUDA code

A grid is composed of thread blocks in the legacy CUDA programming model as in A100, shown in the left half of the diagram. The NVIDIA Hopper Architecture adds an optional cluster hierarchy, shown in the right half of the diagram.

Asynchronous Execution Enhancements

The H100 introduces new features to improve asynchronous execution and enable further overlap of memory copies with computation and other independent work:

Tensor Memory Accelerator (TMA)

The TMA is a new unit that efficiently transfers large blocks of data and multidimensional tensors between global memory and shared memory. It reduces addressing overhead and improves efficiency by supporting different tensor layouts, memory access modes, and reductions.

Asynchronous Transaction Barrier

This new type of barrier counts both thread arrivals and transactions (byte counts).

It allows threads to sleep until all other threads arrive and the sum of all transaction counts reaches an expected value, enabling more efficient synchronisation for asynchronous memory copies and data exchanges.

HBM3 Memory

The H100 GPU comes with up to 80GB of HBM3 memory, which is double the capacity of the A100's 40GB HBM2 memory.

HBM3 provides a memory bandwidth of up to 3 terabytes per second (3 Tb/s), enabling fast data transfer between the GPU and memory, which is crucial for memory-intensive workloads like AI training and scientific simulations.

To put this into perspective, 3 TB/s is equivalent to transferring the entire content of a 1TB hard drive in just 0.33 seconds.

Multi-Instance GPU (MIG)

MIG was a feature first introduced in NVIDIA's Ampere architecture and further enhanced in the Hopper architecture.

It allows a single physical GPU to be partitioned into multiple isolated instances, each with its own dedicated resources, such as compute units, memory, and cache.

With MIG, the H100 GPU can be divided into up to seven independent instances, providing flexibility and efficiency in resource allocation.

This feature is particularly useful in cloud and data centre environments, where multiple users or applications can share a single GPU, ensuring optimal utilisation and predictable performance.

MIG enables better resource management, improved security, and increased versatility in deploying GPU-accelerated workloads.

Let's review the specifications of the NVIDIA H100 GPU

FP32 Performance

performance measures the GPU's ability to perform single-precision arithmetic operations, commonly used in scientific simulations, computer graphics, and some AI workloads.

TFLOPS (Tera Floating-Point Operations Per Second) is a unit that represents the number of floating-point operations the GPU can perform per second.

The H100's FP32 performance of 67 TFLOPS is a substantial increase from the A100's 19.5 TFLOPS, indicating a significant boost in raw computational power.

This high FP32 performance is for applications that require fast and precise single-precision calculations, enabling faster execution of complex simulations and rendering tasks.

Tensor Performance

Tensor performance refers to the GPU's capability to perform mixed-precision matrix multiplication and convolution operations, which are the backbone of deep learning and AI workloads.

FP16, bfloat16, and FP8 are reduced-precision formats that offer higher performance and memory efficiency compared to FP32.

The H100's tensor performance is groundbreaking, with up to 1,979 TFLOPS for FP16 and bfloat16 operations, and an massive 3,958 TFLOPS for FP8 operations.

This represents a massive leap from the A100's tensor performance, enabling faster training and inference of large-scale AI models.

The support for lower-precision formats allows users to leverage the trade-off between precision and performance, achieving higher efficiency and faster results in AI workloads.

Memory Bandwidth

Memory bandwidth is the rate at which data can be read from or written to the GPU's memory.

It is measured in bytes per second (B/s) and is a critical factor in determining the GPU's performance in memory-intensive workloads.

As highlighted, the H100's HBM3 memory subsystem provides 3 TB/s of memory bandwidth, a significant improvement over the A100's 1.6 TB/s.

This high memory bandwidth enables faster data transfer between the GPU and its memory, reducing latency and improving overall performance.

It is particularly beneficial for workloads that involve large datasets, such as high-resolution video processing, scientific simulations, and training of complex AI models.

NVLink and PCIe Gen5

The H100 supports fourth-generation NVLink, providing up to 900 GB/s of bidirectional bandwidth between two GPUs.

This high-speed interconnect enables efficient data exchange and collaboration between GPUs for scaling up multi-GPU systems.

This increased bandwidth allows for faster data transfer between the GPU and the CPU, reducing bottlenecks and enhancing overall system performance.

TDP and Power Efficiency

TDP (Thermal Design Power) is the maximum amount of heat the GPU is expected to generate under typical workloads, and it determines the cooling requirements for the system.

Power efficiency refers to the GPU's ability to deliver high performance while consuming minimal power.

The H100's TDP ranges from 300W to 700W, depending on the specific model and cooling solution.

While the higher TDP models offer the highest performance, they also require more advanced cooling solutions.

The lower TDP models provide a balance between performance and power consumption, making them suitable for systems with limited power and cooling capacity.

NVIDIA has made significant improvements in power efficiency with the Hopper architecture, enabling the H100 to deliver better performance per watt compared to previous generations.

In summary, the NVIDIA H100 GPU's specifications demonstrate its incredible computational power and memory capabilities. A massive leap up from its predecessor, the A-100,

Power Usage

To determine the power usage of the NVIDIA DGX H100 system, we need to consider the power specifications provided for its power supplies.

Power Specifications

  • Power Supply Units (PSUs): The system includes six power supplies.

  • Each Power Supply:

    • Wattage: 3300 W (at 200-240 V, 16 A, 50-60 Hz)

    • Total Power Consumption: 10.2 kW max

Since the system is designed for redundancy, it operates with at least four PSUs actively supplying power. However, to calculate the maximum power usage, we'll consider all six PSUs.

Total Power Consumption Calculation

Each PSU provides 3300 W. With six PSUs:

TotalPowerConsumption=6×3300W=19,800WTotalPowerConsumption=6×3300W=19,800WTotal Power Consumption=6×3300 W=19,800 WTotal Power Consumption=6×3300 W=19,800 WTotalPowerConsumption=6×3300W=19,800WTotalPowerConsumption=6×3300W=19,800W

In kilowatts, this is:

TotalPowerConsumption=19,800W/1000=19.8kWTotalPowerConsumption=19,800W/1000=19.8kWTotal Power Consumption=19,800 W/1000=19.8 kWTotal Power Consumption=19,800 W/1000=19.8 kWTotalPowerConsumption=19,800W/1000=19.8kWTotalPowerConsumption=19,800W/1000=19.8kW

Thus, the NVIDIA DGX H100 system can consume up to 19.8 kW of power when all six power supplies are active and operating at maximum capacity.

Considerations for Power Usage

  • PSU Redundancy: The system includes 4+2 redundancy, meaning it can continue to operate with up to two PSUs failing.

  • Operational Power: The system is designed to function efficiently with at least four active PSUs. If fewer than four PSUs are operational, the system may operate at a reduced performance level or not boot at all.

Summary

The NVIDIA DGX H100 system has a maximum power consumption of 19.8 kW when all six power supplies are functioning at full capacity.

This high power usage supports the system's advanced capabilities, including eight NVIDIA H100 GPUs, two Intel Xeon CPUs, and substantial memory and storage resources.

For more detailed information on the power specifications and other features of the NVIDIA DGX H100 system, refer to the official documentation provided by NVIDIA.

Integration with Other Technologies

The NVIDIA H100 GPU is integrated with various NVIDIA technologies and server solutions

NVIDIA AI Enterprise

  • NVIDIA AI Enterprise is a suite of software tools and frameworks optimised for AI workloads.

  • It includes deep learning frameworks, CUDA-X libraries, and NVIDIA's CUDA Toolkit.

  • The H100 GPU is fully supported by NVIDIA AI Enterprise, enabling development and deployment of AI applications.

  • NVIDIA AI Enterprise simplifies the installation, management, and scaling of AI infrastructure, making it easier for organisations to adopt and use their H100 GPUs.

NVIDIA Magnum IO

  • Magnum IO is a suite of technologies that optimise I/O performance for accelerated computing.

  • GDS enables direct data transfer between storage and GPU memory, bypassing the CPU and reducing latency.

  • GPUDirect RDMA allows direct data transfer between GPUs across different nodes, minimising data movement overhead.

  • The H100 GPU leverages Magnum IO technologies to maximise I/O performance and efficiency in multi-node environments.

NVIDIA Quantum InfiniBand

  • It enables fast communication between GPUs across multiple nodes, allowing efficient scaling of AI and HPC applications.

  • The H100 GPU can be used with NVIDIA Quantum InfiniBand to create high-speed, low-latency clusters for distributed AI and HPC workloads.

  • NVIDIA Quantum InfiniBand's high bandwidth and low latency complement the H100's computational power, enabling efficient scaling and faster time-to-solution.

NVIDIA NVLink Switch System

  • It enables up to 256 H100 GPUs to be connected in a high-speed, low-latency fabric.

  • The NVLink Switch System provides a scalable infrastructure for AI and HPC workloads, allowing seamless scaling of H100 GPU clusters.

  • With the NVLink Switch System, H100 GPUs can efficiently exchange data and collaborate on large-scale AI and HPC tasks, achieving massive parallelism and performance.

NVIDIA H100 Tensor Core GPU performance specifications

Specification
H100 SXM
H100 PCIe
H100 NVL1

FP64

34 teraFLOPS

26 teraFLOPS

68 teraFLOPS

FP64 Tensor Core

67 teraFLOPS

51 teraFLOPS

134 teraFLOPS

FP32

67 teraFLOPS

51 teraFLOPS

134 teraFLOPS

TF32 Tensor Core

989 teraFLOPS2

756 teraFLOPS2

1,979 teraFLOPS2

BFLOAT16 Tensor Core

1,979 teraFLOPS2

1,513 teraFLOPS2

3,958 teraFLOPS2

FP16 Tensor Core

1,979 teraFLOPS2

1,513 teraFLOPS2

3,958 teraFLOPS2

FP8 Tensor Core

3,958 teraFLOPS2

3,026 teraFLOPS2

7,916 teraFLOPS2

INT8 Tensor Core

3,958 TOPS2

3,026 TOPS2

7,916 TOPS2

GPU memory

80GB

80GB

188GB

GPU memory bandwidth

3.35TB/s

2TB/s

7.8TB/s3

Decoders

7 NVDEC 7 JPEG

7 NVDEC 7 JPEG

14 NVDEC 14 JPEG

Max thermal design power (TDP)

Up to 700W (configurable)

300-350W (configurable)

2x 350-400W (configurable)

Multi-instance GPUs

Up to 7 MIGs @ 10GB each

Up to 7 MIGs @ 10GB each

Up to 14 MIGs @ 12GB each

Form factor

SXM

PCIe > dual-slot > air-cooled

2x PCIe > dual-slot > air-cooled

Interconnect

NVLink: > 900GB/s PCIe > Gen5: 128GB/s

NVLink: > 600GB/s PCIe > Gen5: 128GB/s

NVLink: > 600GB/s PCIe > Gen5: 128GB/s

Server options

NVIDIA HGX™ H100 partner and NVIDIA- Certified Systems™ with 4 or 8 GPUs NVIDIA DGX™ H100 with 8 GPUs

Partner and NVIDIA- Certified Systems with 1–8 GPUs

Partner and NVIDIA- Certified Systems with 2-4 pairs

NVIDIA Enterprise Add-on

Included

Included

  • H100 SXM: Uses the SXM (Scalable Link Interface) form factor.

  • H100 PCIe: Uses the PCIe (Peripheral Component Interconnect Express) form factor with a dual-slot, air-cooled design.

  • H100 NVL1: Uses two PCIe cards with a dual-slot, air-cooled design.

Explanation of IEE standard

The IEEE 754 standard, officially known as IEEE Standard for Floating-Point Arithmetic, is a technical standard for floating-point computation established by the Institute of Electrical and Electronics Engineers (IEEE).

The standard was first published in 1985 and has since undergone revisions to include more features and accommodate advancements in computing technology. The most significant revisions were made in 2008, and it's often referred to based on this version as IEEE 754-2008.

Purpose of IEEE 754

The primary goal of IEEE 754 is to provide a uniform standard for floating-point arithmetic.

Before this standard, many different floating-point implementations could lead to discrepancies in calculations across different systems. This variability was problematic, especially for applications requiring consistent and reliable results, such as scientific computations.

IEEE 754 addresses these issues by defining:

Formats for Number Representation

  • Binary formats: These include single precision (32-bit), double precision (64-bit), and extended precision (which can be 80-bit or more).

  • Decimal formats: Introduced in the 2008 revision, these are useful in financial computations where decimal rounding precision is required.

Arithmetic Operations

  • The standard specifies the results of arithmetic operations like addition, subtraction, multiplication, division, square root, and remainder. It also includes rounding rules and handling of exceptional cases like division by zero and overflow.

Rounding Rules

  • IEEE 754 defines several rounding modes to nearest value, towards zero, and towards positive or negative infinity. This is crucial for ensuring that floating-point operations can be consistently replicated across different computing platforms and environments.

Handling of Special Cases

  • The standard provides a detailed mechanism for dealing with special values like infinity (positive and negative), NaNs (Not a Number), and denormalised numbers. Handling these special cases ensures that the floating-point operations do not crash unexpectedly and provide meaningful outputs even under exceptional conditions.

Key Features of IEEE 754

  • Normalised and Denormalised Numbers: Normalised numbers have a normalised mantissa where the most significant digit is non-zero. Denormalised numbers allow for representation of numbers closer to zero than would otherwise be possible with normalised representations.

  • Special Numbers:

    • Infinities: Positive and negative infinities are used to represent results of operations that exceed the maximum representable value.

    • NaN (Not a Number): Used to represent undefined or unrepresentable values, such as 0/00/00/00/00/00/00/00/00/0 or the square root of a negative number.

    • Zero: IEEE 754 makes a distinction between positive and negative zeros, which can be relevant in certain mathematical operations.

Impact of IEEE 754

The adoption of IEEE 754 has had a profound impact on the reliability and portability of software:

  • Consistency: Programs that use IEEE 754 floating-point arithmetic can expect consistent results across compliant systems, crucial for software portability and reproducibility of results.

  • Optimisation: Hardware manufacturers have optimised their processors to efficiently handle IEEE 754 operations, leading to improved performance of applications relying on floating-point calculations.

  • Software Development: The clear rules and definitions provided by IEEE 754 have simplified the development of numerical applications, as developers can rely on standardized behavior of floating-point arithmetic.

Overall, IEEE 754 continues to be a fundamental standard in the computing world, particularly valuable in fields like scientific computing, engineering, and finance, where precision and correctness of numerical computations are critical.

The components of the floating-point representation of numbers

The terms mantissa, exponent, and sign bit are components of the floating-point representation of numbers, as defined by the IEEE 754 standard for floating-point arithmetic.

Each of these components plays a role in how numbers are stored and processed in computers, particularly in the area of scientific calculations where a wide range of values and precision is necessary.

Let’s explore each of these components:

Mantissa (Significand)

The mantissa, also known as the significand, is the part of a floating-point number that contains its significant digits. In the context of the binary systems used in computers:

  • Function: The mantissa represents the precision of the number and essentially carries the "actual" digits of the number. For a given floating-point number, the mantissa represents the number's digits in a scaled format.

  • Details: In a normalised floating-point number, the mantissa is adjusted so the first digit is always a 1 (except in the case of denormalised numbers, where it can be 0). This digit is not stored (known as the "hidden bit" technique) in many floating-point representations to save space, effectively giving an extra bit of precision.

Exponent

The exponent in a floating-point representation scales the mantissa to provide a very large or very small range of values.

  • Function: The exponent determines the scale of the number, effectively shifting the decimal (or binary) point to the right or left. This allows floating-point formats to represent very large or very small numbers compactly.

  • Details: The exponent is stored with a bias in binary systems. For example, an 8-bit exponent in IEEE 754 single-precision floating-point format is stored with a bias of 127. The actual exponent value is calculated by subtracting 127 from the stored exponent value. This bias allows the representation of both positive and negative exponents.

Sign Bit

The sign bit is the simplest of the three; it indicates the sign of the number.

  • Function: It tells whether the number is positive or negative.

  • Details: In floating-point formats, a sign bit of 0 usually represents a positive number, and a sign bit of 1 represents a negative number.

Example in Context:

Consider a 32-bit single-precision floating-point number under IEEE 754:

  • Sign bit: 1 bit

  • Exponent: 8 bits (with a bias of 127)

  • Mantissa: 23 bits (plus 1 hidden bit)

For example, a binary floating-point number could look something like this:

  • Sign bit: 0

  • Exponent: 10000001 (which represents 129 in decimal; with bias subtracted, it becomes 129−127=2129−127=2129−127=2129−127=2129 - 127 = 2129−127=2129−127=2129−127=2129−127=2)

  • Mantissa: 10100000000000000000000 (where the leading 1 is the hidden bit)

This number represents 1.10100000000000000000000×221.10100000000000000000000 \times 2^2 1.10100000000000000000000×22 in binary, which translates into a decimal number after calculating the binary to decimal conversion of the mantissa and applying the exponent as the power of two.

Understanding these components is fundamental for software development and hardware design involving floating-point arithmetic, ensuring precise and efficient numerical computations.

Number types

Peak FP64

  • Definition: This is the maximum number of double-precision floating-point operations per second, measured in teraflops (TFLOPS). Double precision (64-bit) offers high accuracy by using 52 bits for the mantissa, 11 bits for the exponent, and 1 sign bit.

  • Importance: It is crucial for scientific computing and engineering simulations that require high numerical precision to ensure the correctness of results.

Peak FP64 Tensor Core

  • Definition: This measures the same as Peak FP64 but specifically using the Tensor Cores, specialised processing units within NVIDIA GPUs designed to accelerate deep learning tasks.

  • Importance: Provides enhanced performance for certain types of calculations, such as those involving matrices and deep learning models, leveraging optimised hardware.

Peak FP32

  • Definition: The maximum number of single-precision floating-point operations per second. Single precision (32-bit) uses 23 bits for the mantissa, 8 bits for the exponent, and 1 sign bit.

  • Importance: Balances precision and performance. It is widely used in gaming, graphics rendering, and machine learning where high precision is less critical than performance.

Peak FP16

  • Definition: The maximum number of half-precision floating-point operations per second. Half precision (16-bit) allocates 10 bits to the mantissa, 5 bits to the exponent, and 1 sign bit.

  • Importance: Offers a good compromise between storage space and precision, suitable for mobile devices, image processing, and certain types of machine learning models where high precision is less necessary.

Peak BF16

  • Definition: The maximum number of bfloat16 floating-point operations per second. Bfloat16 is a truncated floating point format that uses the same 8-bit exponent as FP32 but with a shorter 7-bit mantissa.

  • Importance: Particularly useful in machine learning and deep learning, providing near-single precision range with reduced precision, which is typically sufficient for these applications.

Peak TF32 Tensor Core

  • Definition: This is a specialised metric for the performance of Tensor Float 32 operations per second using Tensor Cores. TF32 is designed to provide the range of FP32 operations while delivering the performance of FP16.

  • Importance: Critical for AI training tasks, offering a balance of performance and accuracy, enhanced by hardware acceleration.

Peak FP16 Tensor Core

  • Definition: Measures the performance of half-precision operations using Tensor Cores, potentially with and without the use of Sparsity.

  • Importance: Enhances the computational speed for tasks that can tolerate reduced precision, with sparsity providing additional performance gains by skipping zero values in data.

Peak BF16 Tensor Core

  • Definition: Similar to Peak BF16 but specifically using Tensor Cores, again with potential variations for sparsity.

  • Importance: Optimises performance for AI workloads, allowing for faster training times and efficient model deployment.

Peak FP8 Tensor Core

  • Definition: This measures the performance of 8-bit floating-point operations per second using Tensor Cores, designed to provide even lower precision with extremely high throughput.

  • Importance: Useful in scenarios where ultra-high volume computations with lower accuracy requirements are needed, such as certain inference tasks in deep learning.

Peak INT8 Tensor Core

  • Definition: The maximum number of 8-bit integer operations per second using Tensor Cores. This format is common in deep learning inference where high precision is not necessary.

  • Importance: Allows for rapid computation in large-scale and real-time applications, like video processing and real-time AI services, with significant acceleration when using sparsity.

Each of these performance metrics highlights a trade-off between precision and speed, where the choice of number type depends on the specific requirements of the application, such as the need for accuracy versus the need for fast processing and reduced memory usage.

In CUDA, thread blocks in a grid can optionally be grouped at kernel launch into clusters as shown below, and cluster capabilities can be leveraged from the CUDA API.

Thread block clusters and grids with clusters

is a type of high-performance memory that offers higher bandwidth and lower power consumption compared to traditional .

is the latest generation of this type of memory technology.

is NVIDIA's proprietary high-speed interconnect technology that enables fast communication between GPUs in a multi-GPU system.

is the standard interface for connecting GPUs to the CPU and other components in a computer system.

Additionally, the H100 supports , offering up to 128 GB/s of bidirectional bandwidth, doubling the bandwidth of PCIe Gen4.

It includes and GPUDirect RDMA (Remote Direct Memory Access).

NVIDIA is a high-performance, low-latency networking solution for AI and HPC workloads.

The is a high-speed interconnect that allows multiple GPUs to communicate directly with each other.

cooperative_groups
HBM (High Bandwidth Memory)
HBM3
NVLink
PCIe (Peripheral Component Interconnect Express)
PCIe Gen5
GPUDirect Storage (GDS)
NVLink Switch System
Quantum InfiniBand
NVIDIA H100 GPU Thread Block Clusters and Grids that include Thread Block Clusters compared to Grids of Thread Blocks
LogoStanford Seminar - Nvidia’s H100 GPUYouTube
This bloke just loves H-100s - a good review
H100 FP16 Tensor Core has 3x throughput compared to A100 FP16 Tensor Core
Taiwan is home to more than 90% of the manufacturing capacity for the world’s most advanced semiconductors.. Pictured here is a TSMC building in Taiwan
SMs (Streaming Multiprocessors)
DPX instructions accelerate dynamic programming
Asynchronous execution concurrency and enhancements in NVIDIA Hopper
MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores
Page cover image