Page cover image

Why is inference important?

Speed and cost counts

Inference refers to the process of generating predictions or output from the trained model given new input data.

While training models to understand and generate human language is a significant part of developing LLMs, the true utility of these models is realised during inference, when they are applied to real-world tasks.

Optimising inference is critical in the deployment of neural language models because it is where:

  1. The users experience occurs - faster inference is better

  2. Where costs are incurred - low cost inference reduces cost of using models

Inference optimisation is aimed at increasing the speed at which initial output is generated, the total time to generate a full response and the number of outputs per second across all requests (concurrency).

Serving Engines

LLM serving engines provide the critical functionality and optimisations needed to deploy large language models in production.

Capabilities like memory-efficient attention, request batching, and model-specific compilation allow achieving high throughput and low latency.

Frameworks such as TensorRT-LLM and vLLM package these techniques into easy-to-use APIs to accelerate inference on GPUs.

Understanding the capabilities of serving engines is key for practitioners looking to put generative AI into real-world use.

The Architecture of servers and engines - Source: Runway.ai

Key Capabilities of LLM Serving Engines

  • Memory management of key-value (KV) cache

    • Pre-emption mechanism to evict cache blocks when GPU memory is full, using techniques like all-or-nothing eviction of related sequence blocks

    • Reservation strategy to pre-allocate GPU memory for KV cache to avoid eviction

  • Memory and model optimisations

    • Techniques to reduce memory footprint and accelerate inference

  • Batching of requests

    • Combining multiple requests into batches to improve throughput

  • Language model-specific optimizations

    • Custom kernels and compilation for transformer models

NVIDIA TensorRT-LLM (TRT-LLM)

  • Accelerates inference on NVIDIA GPUs by wrapping TensorRT compiler and FasterTransformer kernels

  • Supports tensor parallelism across multiple GPUs and servers

  • Provides optimized versions of popular LLMs like GPT and LLAMA

  • Supports advanced features like in-flight batching and PagedAttention

vLLM

  • High-performance inference library emphasizing throughput and memory efficiency

  • Uses PagedAttention to optimise memory and support larger batch sizes

  • Includes features like continuous batching, GPU parallelism, streaming output

  • Provides Python API for offline inference and launching API servers

Libraries for Enhanced Inference

Several libraries and frameworks have been developed to optimize the inference process:

ONNX Runtime

An open-source project that accelerates machine learning model inference. It supports various hardware optimisations and is compatible with different platforms.

TensorRT

NVIDIA’s SDK for high-performance deep learning inference, which includes a deep learning inference optimizer and runtime for production environments.

TorchScript (PyTorch)

Allows PyTorch models to be run in a high-performance environment without dependency on the Python interpreter, making inference more efficient.

Techniques for Improved Inference

Quantization

Reduces the precision of model parameters (e.g., from float32 to float16 or int8), thereby speeding up inference and reducing memory usage.

Model Pruning

Network pruning reduces the model size by trimming unimportant model weights or connections while the model capacity remains.

Batch Inference

Processing multiple input data points simultaneously, increasing throughput and efficiency, especially in GPU environments.

Asynchronous Inference

Improving real-time responses by decoupling the process of prediction generation from the main application flow.

Last updated

Was this helpful?