LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Stochastic Gradient Descent (SGD)
  • Rethinking Learning Rate Tuning in the Era of Large Language Models
  • Analysis
  • How LRBench++ works
  • On Large-Batch Training for Deep Learning
  • The main observations and contributions of the paper are as follows:

Was this helpful?

  1. Training
  2. The Fine Tuning Process
  3. Hyperparameters

Batch Size and Model loss

PreviousFloating Point NumbersNextBatch Normalisation

Last updated 11 months ago

Was this helpful?

The relationship between batch size and model loss is complex.

Initial findings suggest that increasing batch size may lower performance; however, adjusting the learning rate in conjunction with batch size changes can yield similar performances across varying batch sizes.

The 2018 paper called "Don’t Decay the Learning Rate, Increase the Batch Size" provides several key insights and recommendations regarding the relationship between learning rate, batch size, and model performance in stochastic gradient descent (SGD) optimization

This paper provides several key insights and recommendations regarding the relationship between learning rate, batch size, and model performance in stochastic gradient descent (SGD) optimisation:

Decaying the learning rate during training is equivalent to increasing the batch size in terms of the model's performance on the test set. This is because both strategies reduce the scale of random fluctuations (noise scale) in the SGD dynamics.

  • Decaying the learning rate during training is equivalent to increasing the batch size in terms of the model's performance on the test set. This is because both strategies reduce the scale of random fluctuations (noise scale) in the SGD dynamics.

  • Increasing the batch size instead of decaying the learning rate can significantly reduce the number of parameter updates required to train a model, leading to shorter training times and improved computational efficiency.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimisation algorithm commonly used in machine learning to update the model parameters (weights) in the direction of the negative gradient of the loss function. The key steps in SGD are:

  1. Initialise the model parameters randomly.

  2. For each training iteration:

a. Sample a mini-batch of examples from the training set.

b. Compute the average gradient of the loss function with respect to the model parameters over the mini-batch.

c. Update the model parameters by taking a step in the direction of the negative gradient, scaled by the learning rate.

  1. Repeat step 2 until convergence or for a fixed number of iterations.

The learning rate determines the size of the steps taken in the parameter space during each update.

A higher learning rate leads to larger steps, while a lower learning rate results in smaller steps.

The batch size is the number of training examples used to compute the gradient in each iteration.

Larger batch sizes provide a more accurate estimate of the true gradient but require more computation per update.

The authors show that the convergence of SGD is governed by the noise scale, which depends on both the learning rate and the batch size.

By carefully adjusting these hyperparameters during training, the noise scale can be reduced, leading to faster convergence and more efficient training.

The proposed strategies of increasing the batch size and scaling the learning rate and momentum coefficient enable practitioners to significantly reduce the number of parameter updates required to train a model without sacrificing performance.

Rethinking Learning Rate Tuning in the Era of Large Language Models

This 2023 paper addresses the challenges and opportunities of learning rate tuning in the era of Large Language Models (LLMs).

The authors argue that existing learning rate policies, primarily designed for traditional deep neural networks (DNNs), may not work well for LLM fine-tuning due to the unique characteristics of LLMs, such as high model complexity and expensive training costs.

The paper makes three main contributions:

  1. It revisits existing learning rate policies to analyze the critical challenges of learning rate tuning for LLMs.

  2. It presents LRBench++, a benchmarking tool for learning rate policies, to facilitate learning rate tuning for both traditional DNNs and LLMs.

  3. It conducts experimental analyses using LRBench++ to demonstrate the key differences between LLM fine-tuning and traditional DNN training, validating their analysis.

Analysis

The paper highlights an important issue in the era of LLMs - the need to reassess and adapt existing learning rate tuning strategies to the unique characteristics of LLMs.

The authors identify the key differences between LLM fine-tuning and traditional DNN training, such as the much higher model complexity (billions vs. millions of parameters), prohibitively expensive training costs, different model initialization (pre-trained vs. random), fewer training epochs, and different evaluation strategies.

The paper's contribution of LRBench++ is valuable, as it provides a benchmarking tool specifically designed for learning rate tuning, which can be used for both traditional DNNs and LLMs.

This tool can help researchers and practitioners compare and evaluate different learning rate policies more effectively.

LRBench++ is a benchmarking tool designed to facilitate learning rate tuning for both traditional deep neural networks (DNNs) and large language models (LLMs). The tool allows researchers and practitioners to evaluate and compare the performance of different learning rate policies and their impact on the training/fine-tuning process.

How LRBench++ works

LRBench++ provides a unified framework for defining, implementing, and evaluating various learning rate policies, including formula-based, state-based, and exploration-based policies.

The tool integrates with popular deep learning frameworks, such as TensorFlow and PyTorch, making it easy to incorporate into existing training/fine-tuning pipelines.

LRBench++ allows users to define custom metrics for evaluating the performance of learning rate policies, such as the validation loss, accuracy, or computational cost.

The tool provides visualization capabilities to analyze the behavior of different learning rate policies during the training/fine-tuning process, helping users gain insights into the optimization paths and the impact of learning rate values on model performance.

LRBench++ also includes a collection of pre-defined learning rate policies and benchmark datasets, enabling users to quickly compare and evaluate different policies on standard tasks.

On Large-Batch Training for Deep Learning

This widely cited 2017 paper, titled "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" by Keskar et al., investigates the phenomenon of performance degradation in deep learning models when trained with large batch sizes.

The authors aim to understand the cause of this generalization gap and provide numerical evidence to support their findings.

The main observations and contributions of the paper are as follows:

Generalization gap in large-batch training

The authors observe that when training deep learning models with large batch sizes, there is a significant drop in the model's ability to generalise to unseen data (testing accuracy), despite achieving similar performance on the training data as models trained with small batch sizes.

Convergence to sharp minima

The paper provides numerical evidence that suggests large-batch methods tend to converge to sharp minimisers of the training function, which are characterised by a significant number of large positive eigenvalues in the Hessian matrix.

In contrast, small-batch methods converge to flat minimisers, which have numerous small eigenvalues.

Visualization of loss function landscape

The authors use parametric plots to visualise the loss function landscape around the minimisers obtained by small-batch and large-batch methods.

These plots demonstrate that the large-batch minimisers are significantly sharper than the small-batch minimisers.

Sharpness metric

To quantify the sharpness of a minimiser, the authors propose a metric that measures the maximum value of the loss function within a small neighbourhood of the minimiser.

They use this metric to compare the sharpness of minimizers obtained by small-batch and large-batch methods, confirming that large-batch methods lead to sharper minimizers.

Relationship between batch size and generalization

The paper shows that there exists a threshold for the batch size, above which there is a significant drop in the model's generalization performance.

This threshold varies depending on the network architecture and dataset.

Success of small-batch methods

The authors hypothesise that the noise in the stochastic gradient used by small-batch methods helps in escaping the basins of attraction of sharp minimisers, leading to convergence towards flatter minimizers that generalise better.

They support this hypothesis through experiments involving warm-starting large-batch training with iterates obtained from small-batch training.

Discussion and future directions

The paper discusses the implications of their findings and raises several questions for future research, such as proving the convergence of large-batch methods to sharp minimisers, understanding the relative density of sharp and flat minima, and designing algorithms or architectures that can steer large-batch methods away from sharp minimizers.

Throughout the paper, the authors provide extensive numerical experiments on various deep learning architectures and datasets to support their claims.

They also explore potential remedies to the generalization problem of large-batch methods, such as data augmentation, conservative training, and adversarial training, but find that these approaches do not completely solve the issue.

In conclusion, this paper sheds light on the generalization gap observed in large-batch training for deep learning and provides empirical evidence that the convergence to sharp minimizers is a primary cause of this phenomenon. The findings have significant implications for the development of efficient training methods for deep learning models, as large-batch training is crucial for leveraging parallelism and reducing training time.

The noise scale is defined as g=√(N/B−1)g = √(N/B - 1)g=√(N/B−1), where NNN is the training set size, BBB is the batch size, and is the learning rate. Reducing this noise scale during training is beneficial, which can be achieved by either decaying the learning rate or increasing the batch size.

The learning rate and batch size can be scaled together according to the linear scaling rule: B∝B ∝B∝ . By increasing the learning rate and scaling the batch size accordingly, the number of parameter updates can be further reduced without sacrificing model performance.

The momentum coefficient (m)(m) (m)can also be increased to reduce the number of parameter updates, by scaling the batch size as B∝1/(1−m)B ∝ 1/(1-m)B∝1/(1−m). However, this may lead to a slight reduction in test accuracy.

LogoDon't Decay the Learning Rate, Increase the Batch SizearXiv.org
Don’t Decay the Learning Rate, Increase the Batch Size
LogoRethinking Learning Rate Tuning in the Era of Large Language ModelsarXiv.org
LogoOn Large-Batch Training for Deep Learning: Generalization Gap and...arXiv.org
A Conceptual Sketch of Flat and Sharp Minima. The Y-axis indicates value of the loss function and the X-axis the variables (parameters)
Page cover image