LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Data Transfer
  • Power Efficiency
  • Bandwidth
  • Reliability
  • Latency
  • Summary

Was this helpful?

  1. Infrastructure
  2. Data and Memory

High Bandwidth Memory (HBM3)

SK Hynix Inc

PreviousRemote Direct Memory Access (RDMA)NextFlash Memory

Last updated 11 months ago

Was this helpful?

High Bandwidth Memory 3 (HBM3) is the latest generation of memory technology.

It is an advanced memory system that provides very high data transfer speeds (bandwidth), uses low power, and packs a large amount of memory (high capacity) into a small physical size (form factor).

HBM a type of memory architecture used in high-performance computing. It's known for its ability to provide extremely high memory bandwidth.

HBM3e is the latest generation in the HBM series, following HBM2 and HBM2E.

The 'e' in HBM3e denotes an enhanced version of the HBM3 standard.

And there may be further development, the market may soon witness a second generation of HBM3 devices, following the trend set by and LPDDR5, which have already seen speed upgrades

HBM also uses a very wide interface to the processor chip.

An interface is how two different parts of a system connect and communicate with each other. By using many parallel connections (like having many lanes on a highway), HBM can send and receive a massive amount of data to/from the processor simultaneously.

One of the most significant advantages of HBM3 is its increased storage capacity.

Supporting up to 32 Gb of density and a 16-high stack, HBM3 can provide a maximum of 64 GB of storage, almost triple that of HBM2E.

This expanded memory capacity is crucial for handling the increasing demands of advanced applications.

Data Transfer

In addition to its storage capabilities, HBM3 boasts speed, with a top data transfer rate of 6.4 Gbps, nearly doubling the speed of HBM2E (3.6 Gbps).

HBM memory stacks several chips vertically on a substrate, which is then connected to a processor or GPU via a silicon interposer.

Vertical Stacking and Silicon Interposer

HBM uses an innovative approach of stacking multiple DRAM dies on top of each other vertically.

DRAM stands for Dynamic Random Access Memory, which is the type of memory commonly used in computers. A die is a small block of semiconducting material on which a given functional circuit is fabricated. So an HBM stack has several DRAM dies stacked up.

  • HBM (High Bandwidth Memory) uses a unique architecture where multiple DRAM chips are stacked vertically on a substrate, rather than being placed side by side like in traditional memory layouts.

  • The stacked DRAM chips are connected to a processor or GPU using a , which is a thin layer of silicon that sits between the memory stack and the processor/GPU.

  • The silicon interposer contains a large number of tiny wires (interconnects) that enable high-speed communication between the stacked memory and the processor/GPU.

  • This vertical stacking and use of a silicon interposer allow for a much wider interface and higher bandwidth compared to traditional memory configurations.

The benefits of stacking

The DRAM dies are linked together using vertical interconnects called Through-Silicon Vias (TSVs).

A TSV is a vertical electrical connection that passes completely through a silicon die. It allows the stacked dies to communicate with each other much faster than traditional wire-bonding. Think of it as an elevator shaft that lets data move between different floors (dies) quickly.

This vertical stacking, combined with a wider interface, enables much higher bandwidth compared to traditional flat, planar layouts of DRAM.

This significant increase in bandwidth enables faster processing and improved overall system performance.

Power Efficiency

Another key benefit of HBM3 is its improved power efficiency.

HBM3 reduces the (the voltage supplied to the DRAM chips) to 1.1 volts from HBM2E's 1.2 volts.

This lower voltage means less power is consumed by the memory. This allows HBM3 to offer substantial power savings without compromising performance.

Remember, power consumption is proportional to the square of the voltage (P=V2/R)(P = V^2 / R)(P=V2/R). So even a small reduction in voltage can have a significant impact on power efficiency. The challenge is maintaining signal integrity and data retention at lower voltages.

This improved power efficiency has permitted improvements in bandwidth, reliability

Bandwidth

HBM3 achieves this through an enhanced channel architecture, dividing its 1024-bit interface into 16 64-bit channels or 32 32-bit pseudo-channels.

What are pseudo-channels: HBM3 splits each physical 64-bit channel into two 32-bit "pseudo-channels". This effectively doubles the number of independent sub-channels from 16 to 32.

More pseudo-channels allow greater parallelism - more data can be accessed simultaneously from different regions of the DRAM. This improves bandwidth utilisation and performance.

However, the pseudo-channel logic does consume some additional power. The power savings from the core voltage reduction help offset this.

Nonetheless. this doubled number of pseudo-channels, combined with the increased data rate, results in a substantial performance improvement over HBM2E.

Reliability

HBM3 also incorporates advanced RAS (reliability, availability, and serviceability) features that enhance data integrity and system reliability.

On-die ECC

Error-Correcting Code (ECC) is a method of detecting and correcting bit errors in memory. HBM3 introduces on-die ECC, where the ECC bits are stored and the correction is performed within each DRAM die.

On-die ECC improves reliability by catching and fixing errors locally before data is transmitted to the host. However, the ECC circuits do add some power overhead. Careful design is needed to minimise this.

Error Check and Scrub (ECS)

This is a background process that periodically reads data from the DRAM, checks the ECC for errors, and writes back corrected data if necessary.

ECS helps maintain data integrity over time, preventing the accumulation of bit errors. The scrubbing does consume some additional power, but it is essential for mission-critical applications.

Refresh Management

DRAM cells lose their data over time due to charge leakage and must be periodically refreshed. HBM3 introduces advanced refresh management techniques like Refresh Management (RFM) and Adaptive Refresh Management (ARFM).

These allow the refresh rate to be optimised based on temperature and usage conditions. Unnecessary refreshes can be avoided, saving power. The refresh logic does add some complexity and power, but the net effect is a power savings.

Latency

The new clocking architecture in HBM3 decouples the traditional clock signal from the host to the device and the data strobe signals, allowing for a lower latency and high-performance solution when migrating from HBM2E to HBM3.

Clock architecture

HBM3 decouples the command/address clock from the data bus clock. The command clock runs at half the frequency of the data clock. This allows the DRAM I/O to run faster without burdening the core DRAM arrays.

Splitting the clocks does require some additional clock generation and synchronisation logic which consumes power. But it enables a significant data rate increase without a proportional power increase.

Summary

In conclusion, HBM3 represents a leap forward in memory technology, offering increased storage capacity, faster data transfer rates, improved power efficiency, and advanced features.

With its ability to meet the growing demands of high-performance computing applications, HBM3 is poised to become the memory solution of choice for industries seeking cutting-edge performance and efficiency.

As the adoption of HBM3 grows, we can expect to see groundbreaking advancements in graphics, cloud computing, networking, AI, and automotive sectors, propelling us into a new era of technological innovation.

Other Technical Details

Pseudo-Channels:

  • Each HBM channel is quite wide (128 bits in HBM2). To further increase parallelism, each channel can be split into narrower "pseudo-channels".

  • HBM2 could split each channel into 2 pseudo-channels of 64 bits, while HBM3 can split into 4 pseudo-channels of 32 bits.

  • This allows even more simultaneous data access within a channel.

Wide Data Interface

  • HBM has a very wide data bus - 1024 bits in total. In HBM2 this was split into 8 channels of 256 bits each, while in HBM3 it's 16 channels of 128 bits.

  • This wide interface allows a high data rate (amount of data transferred per second) to be achieved at a relatively lower clock speed, which helps manage power consumption.

High-Speed Signaling:

  • HBM uses advanced circuit techniques to achieve very high signaling rates on the data interface.

  • HBM2 could transfer data at up to 2 Gigabits per second (Gbps) per pin. HBM2E increased this to 3.2 Gbps, and HBM3 will reach 6.4 Gbps or even higher.

  • To ensure reliable operation at these high speeds, HBM uses techniques like equalization (adjusting the signal strength and shape), Forward Error Correction (adding redundant data that allows errors to be corrected), and careful alignment and training of the signal timing.

Separate Row and Column Commands

  • The DRAM chips in HBM are arranged in a grid of rows and columns. Accessing data requires first selecting a row (called activating the row) and then reading or writing the desired columns.

  • HBM has separate command buses for row commands (like activate and precharge) and column commands (like read and write). This allows the memory controller to prepare the next row while still reading or writing data from the current row, improving utilization.

Additive Latency and Read-Modify-Write

  • Additive latency allows the memory controller to send a read command before the associated row activate command. The DRAM internally delays the read until the row is ready. This hides some of the row activation time, reducing overall latency.

  • Read-modify-write is a feature that allows a small piece of data within a larger block to be updated without having to read the entire block, modify it in the processor, and write the entire block back. This saves time and power.

To achieve its high bandwidth goals, HBM requires careful engineering of the electrical signals and power delivery:

Signal Integrity

  • Signal integrity refers to ensuring that the electrical signals representing the data maintain their correct shape and timing as they travel from the sender to the receiver.

  • At the high speeds used by HBM, this is challenging. The engineers must carefully model the entire signal path, including the microscopic bumps and TSVs in the HBM stack and the wiring in the interposer.

  • They use advanced simulation models (like IBIS-AMI) that capture the behavior of the circuits and the physics of the signals. They run many simulations with different patterns of data to statistically predict the likelihood of signal errors (the Bit Error Rate or BER).

  • The goal is to achieve a clean "eye diagram" - a visual representation of the signal quality that shows there is adequate margin for the signal voltage and timing to correctly represent the data bits even with the expected manufacturing variability and changes in operating conditions like temperature and voltage.

Power Integrity

  • Power integrity refers to ensuring that the voltage supplied to the HBM remains stable and noise-free despite the large, fast changes in current draw as the HBM operates.

  • The electrical current consumed by the HBM has to flow through the wiring in the interposer and package. The resistance and inductance of this wiring causes the voltage to droop when the current changes rapidly, which can cause errors if the voltage goes too low.

  • To mitigate this, the engineers place decoupling capacitors (small reservoirs of charge) very close to the HBM stack - on the same die, on the interposer, and on the package. These capacitors supply current to the HBM when needed and help smooth out the voltage fluctuations.

  • The placement and size of these capacitors, as well as the geometry of the power delivery network wiring, must be carefully optimized to ensure a clean power supply.

  • In some cases, specialized voltage regulators designed specifically for HBM may be placed very close to the stack to further improve the power integrity.

In addition to raw performance, HBM includes several reliability, availability, and serviceability (RAS) features:

Error Checking

  • HBM includes parity bits on critical signals like the command bus and address bus. Parity is a simple form of error detection where an extra bit is added to a group of bits to make the total number of '1's either even or odd. If a single bit gets corrupted, the parity will no longer match, indicating an error.

  • HBM3 introduces an even more advanced error detection method called Pulse Amplitude Modulation 4-level (PAM4). This allows the receiver to not only detect errors but also to assess the signal quality in real-time.

Error Correction

  • While earlier HBM versions could only detect errors, HBM3 adds the ability to correct errors using Error Correction Codes (ECC). ECC works by adding redundant bits to the data that allow the receiver to not only detect but also correct a certain number of bit errors.

  • HBM3 includes ECC within the DRAM chips themselves, which can correct single-bit errors and detect multi-bit errors.

  • The HBM also performs background scrubbing, where it periodically reads out the data, checks and corrects any errors, and writes the corrected data back. This prevents the accumulation of errors over time.

Lane Repair

  • HBM includes spare data and command/address lanes (like extra highway lanes). If a lane is not performing well or has failed, the controller can swap in one of the spare lanes to replace it.

  • This repair can be done by programming the HBM's configuration registers, or it can be made permanent by blowing fuses (essentially tiny electrical switches that can be permanently opened).

Temperature Monitoring

  • The performance and reliability of DRAM is sensitive to temperature. If the DRAM gets too hot, it may not be able to meet its timing requirements, leading to errors.

  • HBM includes several temperature sensors within the stack that continuously monitor the temperature.

  • The memory controller can read out these temperature values and adjust its operation accordingly. For example, it can increase the frequency of refresh operations (which are necessary to maintain data integrity in DRAM) at higher temperatures. If the temperature gets too high, the controller can throttle down the data transfer rate to reduce power consumption and heat generation.

Designing an HBM system that achieves multi-gigabit per second data rates while maintaining signal integrity, power integrity, thermal control, and high reliability is a complex multi-disciplinary challenge.

It requires close collaboration and co-design across the entire system, from the DRAM chips themselves to the interposer, the package, the PHY circuits, and the memory controller.

The engineering teams must carefully budget and allocate the available timing and voltage margins across these different components.

They rely heavily on detailed modeling and simulation to predict and optimise the system's behavior.

Specialised circuit designs, advanced packaging technologies, sophisticated error correction and calibration methods, and dynamic monitoring and adaptation techniques are all essential to make the HBM system robust and reliable.

When all of these elements come together successfully, HBM provides a step-change improvement in memory performance within a manageable power envelope.

This has made HBM a critical enabler for applications like high-performance computing, artificial intelligence and machine learning training, and high-speed networking, which all demand the highest possible memory bandwidth.

As the latest generation, HBM3, ramps up into volume production, and future generations like HBM4 are developed, we can expect to see HBM continue to advance the leading edge of computing systems. It's a complex and fascinating technology that showcases the ingenuity and perseverance of the engineers and scientists pushing the boundaries of what's possible in the world of semiconductors and computing architecture.

I/O speed of different HBM versions
Evolution of HBM cheat sheet
Page cover image