# Why is inference important?

Inference refers to the process of generating predictions or output from the trained model given new input data.&#x20;

While training models to understand and generate human language is a significant part of developing LLMs, the *<mark style="color:yellow;">**true utility of these models is realised during inference**</mark>*, when they are applied to real-world tasks.

<mark style="color:blue;">**Optimising inference**</mark> is critical in the deployment of neural language models because it is where:

1. The users experience occurs - faster inference is better
2. Where costs are incurred - low cost inference reduces cost of using models

Inference optimisation is aimed at increasing the speed at which initial output is generated, the total time to generate a full response and the number of outputs per second across all requests (concurrency).

### <mark style="color:purple;">Serving Engines</mark>

LLM serving engines provide the critical functionality and optimisations needed to deploy large language models in production.&#x20;

Capabilities like memory-efficient attention, request batching, and model-specific compilation allow achieving high throughput and low latency.&#x20;

Frameworks such as TensorRT-LLM and vLLM package these techniques into easy-to-use APIs to accelerate inference on GPUs.&#x20;

Understanding the capabilities of serving engines is key for practitioners looking to put generative AI into real-world use.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2F91qdSRU4gfjgNaF6GSqh%2Fimage.png?alt=media&#x26;token=d6f3e85d-4d83-4850-81c8-9d3636f9ecbc" alt=""><figcaption><p>The Architecture of servers and engines - Source: Runway.ai</p></figcaption></figure>

### <mark style="color:purple;">Key Capabilities of LLM Serving Engines</mark>

* Memory management of key-value (KV) cache
  * Pre-emption mechanism to evict cache blocks when GPU memory is full, using techniques like all-or-nothing eviction of related sequence blocks
  * Reservation strategy to pre-allocate GPU memory for KV cache to avoid eviction
* Memory and model optimisations
  * Techniques to reduce memory footprint and accelerate inference
* Batching of requests
  * Combining multiple requests into batches to improve throughput
* Language model-specific optimizations
  * Custom kernels and compilation for transformer models

### <mark style="color:purple;">Popular Open-Source Frameworks</mark>

#### <mark style="color:green;">NVIDIA TensorRT-LLM (TRT-LLM)</mark>

* Accelerates inference on NVIDIA GPUs by wrapping TensorRT compiler and FasterTransformer kernels
* Supports tensor parallelism across multiple GPUs and servers
* Provides optimized versions of popular LLMs like GPT and LLAMA
* Supports advanced features like in-flight batching and PagedAttention

#### <mark style="color:green;">vLLM</mark>

* High-performance inference library emphasizing throughput and memory efficiency
* Uses PagedAttention to optimise memory and support larger batch sizes
* Includes features like continuous batching, GPU parallelism, streaming output
* Provides Python API for offline inference and launching API servers

### <mark style="color:purple;">Libraries for Enhanced Inference</mark>

Several libraries and frameworks have been developed to optimize the inference process:

<mark style="color:green;">**ONNX Runtime**</mark>

An open-source project that accelerates machine learning model inference. It supports various hardware optimisations and is compatible with different platforms.

<mark style="color:green;">**TensorRT**</mark>

NVIDIA’s SDK for high-performance deep learning inference, which includes a deep learning inference optimizer and runtime for production environments.

<mark style="color:green;">**TorchScript (PyTorch)**</mark>

Allows PyTorch models to be run in a high-performance environment without dependency on the Python interpreter, making inference more efficient.

### <mark style="color:purple;">Techniques for Improved Inference</mark>

<mark style="color:green;">**Quantization**</mark>

Reduces the precision of model parameters (e.g., from float32 to float16 or int8), thereby speeding up inference and reducing memory usage.

<mark style="color:green;">**Model Pruning**</mark>

Network pruning reduces the model size by trimming unimportant model weights or connections while the model capacity remains.

<mark style="color:green;">**Batch Inference**</mark>

Processing multiple input data points simultaneously, increasing throughput and efficiency, especially in GPU environments.

<mark style="color:green;">**Asynchronous Inference**</mark>

Improving real-time responses by decoupling the process of prediction generation from the main application flow.
