# Why is inference important?

Inference refers to the process of generating predictions or output from the trained model given new input data.&#x20;

While training models to understand and generate human language is a significant part of developing LLMs, the *<mark style="color:yellow;">**true utility of these models is realised during inference**</mark>*, when they are applied to real-world tasks.

<mark style="color:blue;">**Optimising inference**</mark> is critical in the deployment of neural language models because it is where:

1. The users experience occurs - faster inference is better
2. Where costs are incurred - low cost inference reduces cost of using models

Inference optimisation is aimed at increasing the speed at which initial output is generated, the total time to generate a full response and the number of outputs per second across all requests (concurrency).

### <mark style="color:purple;">Serving Engines</mark>

LLM serving engines provide the critical functionality and optimisations needed to deploy large language models in production.&#x20;

Capabilities like memory-efficient attention, request batching, and model-specific compilation allow achieving high throughput and low latency.&#x20;

Frameworks such as TensorRT-LLM and vLLM package these techniques into easy-to-use APIs to accelerate inference on GPUs.&#x20;

Understanding the capabilities of serving engines is key for practitioners looking to put generative AI into real-world use.

<figure><img src="/files/lP6Ze9bvPG6gX0Rt0Yab" alt=""><figcaption><p>The Architecture of servers and engines - Source: Runway.ai</p></figcaption></figure>

### <mark style="color:purple;">Key Capabilities of LLM Serving Engines</mark>

* Memory management of key-value (KV) cache
  * Pre-emption mechanism to evict cache blocks when GPU memory is full, using techniques like all-or-nothing eviction of related sequence blocks
  * Reservation strategy to pre-allocate GPU memory for KV cache to avoid eviction
* Memory and model optimisations
  * Techniques to reduce memory footprint and accelerate inference
* Batching of requests
  * Combining multiple requests into batches to improve throughput
* Language model-specific optimizations
  * Custom kernels and compilation for transformer models

### <mark style="color:purple;">Popular Open-Source Frameworks</mark>

#### <mark style="color:green;">NVIDIA TensorRT-LLM (TRT-LLM)</mark>

* Accelerates inference on NVIDIA GPUs by wrapping TensorRT compiler and FasterTransformer kernels
* Supports tensor parallelism across multiple GPUs and servers
* Provides optimized versions of popular LLMs like GPT and LLAMA
* Supports advanced features like in-flight batching and PagedAttention

#### <mark style="color:green;">vLLM</mark>

* High-performance inference library emphasizing throughput and memory efficiency
* Uses PagedAttention to optimise memory and support larger batch sizes
* Includes features like continuous batching, GPU parallelism, streaming output
* Provides Python API for offline inference and launching API servers

### <mark style="color:purple;">Libraries for Enhanced Inference</mark>

Several libraries and frameworks have been developed to optimize the inference process:

<mark style="color:green;">**ONNX Runtime**</mark>

An open-source project that accelerates machine learning model inference. It supports various hardware optimisations and is compatible with different platforms.

<mark style="color:green;">**TensorRT**</mark>

NVIDIA’s SDK for high-performance deep learning inference, which includes a deep learning inference optimizer and runtime for production environments.

<mark style="color:green;">**TorchScript (PyTorch)**</mark>

Allows PyTorch models to be run in a high-performance environment without dependency on the Python interpreter, making inference more efficient.

### <mark style="color:purple;">Techniques for Improved Inference</mark>

<mark style="color:green;">**Quantization**</mark>

Reduces the precision of model parameters (e.g., from float32 to float16 or int8), thereby speeding up inference and reducing memory usage.

<mark style="color:green;">**Model Pruning**</mark>

Network pruning reduces the model size by trimming unimportant model weights or connections while the model capacity remains.

<mark style="color:green;">**Batch Inference**</mark>

Processing multiple input data points simultaneously, increasing throughput and efficiency, especially in GPU environments.

<mark style="color:green;">**Asynchronous Inference**</mark>

Improving real-time responses by decoupling the process of prediction generation from the main application flow.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/inference/why-is-inference-important.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
