Why is inference important?
Speed and cost counts
Last updated
Copyright Continuum Labs - 2023
Speed and cost counts
Last updated
Inference refers to the process of generating predictions or output from the trained model given new input data.
While training models to understand and generate human language is a significant part of developing LLMs, the true utility of these models is realised during inference, when they are applied to real-world tasks.
Optimising inference is critical in the deployment of neural language models because it is where:
The users experience occurs - faster inference is better
Where costs are incurred - low cost inference reduces cost of using models
Inference optimisation is aimed at increasing the speed at which initial output is generated, the total time to generate a full response and the number of outputs per second across all requests (concurrency).
LLM serving engines provide the critical functionality and optimisations needed to deploy large language models in production.
Capabilities like memory-efficient attention, request batching, and model-specific compilation allow achieving high throughput and low latency.
Frameworks such as TensorRT-LLM and vLLM package these techniques into easy-to-use APIs to accelerate inference on GPUs.
Understanding the capabilities of serving engines is key for practitioners looking to put generative AI into real-world use.
Memory management of key-value (KV) cache
Pre-emption mechanism to evict cache blocks when GPU memory is full, using techniques like all-or-nothing eviction of related sequence blocks
Reservation strategy to pre-allocate GPU memory for KV cache to avoid eviction
Memory and model optimisations
Techniques to reduce memory footprint and accelerate inference
Batching of requests
Combining multiple requests into batches to improve throughput
Language model-specific optimizations
Custom kernels and compilation for transformer models
Accelerates inference on NVIDIA GPUs by wrapping TensorRT compiler and FasterTransformer kernels
Supports tensor parallelism across multiple GPUs and servers
Provides optimized versions of popular LLMs like GPT and LLAMA
Supports advanced features like in-flight batching and PagedAttention
High-performance inference library emphasizing throughput and memory efficiency
Uses PagedAttention to optimise memory and support larger batch sizes
Includes features like continuous batching, GPU parallelism, streaming output
Provides Python API for offline inference and launching API servers
Several libraries and frameworks have been developed to optimize the inference process:
ONNX Runtime
An open-source project that accelerates machine learning model inference. It supports various hardware optimisations and is compatible with different platforms.
TensorRT
NVIDIA’s SDK for high-performance deep learning inference, which includes a deep learning inference optimizer and runtime for production environments.
TorchScript (PyTorch)
Allows PyTorch models to be run in a high-performance environment without dependency on the Python interpreter, making inference more efficient.
Quantization
Reduces the precision of model parameters (e.g., from float32 to float16 or int8), thereby speeding up inference and reducing memory usage.
Model Pruning
Network pruning reduces the model size by trimming unimportant model weights or connections while the model capacity remains.
Batch Inference
Processing multiple input data points simultaneously, increasing throughput and efficiency, especially in GPU environments.
Asynchronous Inference
Improving real-time responses by decoupling the process of prediction generation from the main application flow.