# Triton Inference Server - Introduction

The Triton Inference Server is an open-source platform designed to deploy and manage AI models across various environments efficiently.

It supports a wide range of machine learning and deep learning frameworks, including TensorFlow, PyTorch, ONNX, OpenVINO, and others, making it versatile for different AI applications.

### <mark style="color:purple;">Key capabilities of Triton Inference Server include</mark>

<mark style="color:green;">**Multi-Framework Support:**</mark>Triton is compatible with numerous AI frameworks, allowing teams to deploy models regardless of the framework they were trained in.

<mark style="color:green;">**Cross-Platform Deployment:**</mark> It can be deployed across various platforms, including cloud, data centers, edge devices, and embedded systems, and supports NVIDIA GPUs, x86 and ARM CPUs, and AWS Inferentia.

<mark style="color:green;">**Optimised Performance:**</mark> Triton offers optimised performance for various query types, such as real-time, batch, ensemble, and audio/video streaming, ensuring efficient resource utilisation.

<mark style="color:green;">**Concurrent Model Execution:**</mark> It enables the simultaneous execution of multiple models, enhancing throughput and reducing latency.

<mark style="color:green;">**Dynamic Batching:**</mark> This feature groups together incoming inference requests for batch processing, improving efficiency and resource utilisation.

<mark style="color:green;">**Sequence Batching and State Management:**</mark> Triton manages stateful models effectively with sequence batching and implicit state management, crucial for applications like time-series analysis.

<mark style="color:green;">**Custom Backends and Pre/Post-Processing:**</mark> The Backend API allows the addition of custom backends and operations, providing flexibility to tailor the server to specific needs.

<mark style="color:green;">**Model Pipelines:**</mark> With ensembling and Business Logic Scripting (BLS), users can create complex model pipelines for advanced inference scenarios.

<mark style="color:green;">**Multiple Inference Protocols:**</mark> Triton supports HTTP/REST and GRPC protocols, making it accessible from various client applications.

<mark style="color:green;">**Monitoring and Metrics:**</mark> It provides detailed metrics on GPU utilization, server throughput, and latency, aiding in performance monitoring and optimization.

Triton Inference Server is part of NVIDIA AI Enterprise, offering a robust platform for developing and deploying AI models at scale.

{% embed url="<https://www.youtube.com/watch?v=1kOaYiNVgFs>" %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/inference/why-is-inference-important/triton-inference-server-introduction.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
