# Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator

### <mark style="color:purple;">Introduction</mark>

GPUs have become a crucial resource for training and inferencing neural networks.&#x20;

However, GPUs are expensive and often scarce, making it essential to maximise their utilisation. Kubernetes, a popular container orchestration platform, can help manage and share GPU resources efficiently across multiple containers.&#x20;

{% embed url="<https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html>" %}

### <mark style="color:purple;">Kubernetes and GPU Resource Management</mark>

Kubernetes provides a way to manage and allocate resources, including GPUs, to containers.&#x20;

By installing the NVIDIA GPU Operator, you can easily configure and manage GPU resources within your Kubernetes cluster.&#x20;

The GPU Operator includes a device plugin that discovers the available GPUs on each node and exposes them as allocatable resources.

To consume GPU resources in a Kubernetes pod, you simply need to specify the resource request and limit in the pod specification. For example:

```yaml
resources:
  limits:
    nvidia.com/gpu: 1
```

This pod will request one GPU. You can also use node selectors to specify the desired GPU type if your cluster has heterogeneous GPU nodes.

<mark style="color:green;">**Sharing GPUs with Multi-Instance GPU (MIG)**</mark>

NVIDIA's <mark style="color:blue;">**Multi-Instance GPU (MIG)**</mark> feature allows you to partition a physical GPU into multiple logical instances. Each instance has its own dedicated memory and compute resources, providing isolation and predictable performance.

To enable MIG in Kubernetes, you *<mark style="color:yellow;">**configure the GPU Operator with MIG profiles**</mark>*.

These profiles define how the GPU should be partitioned. For example:

```yaml
version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4
```

This configuration creates four logical GPU instances from a single physical GPU.&#x20;

Kubernetes will expose these instances as separate allocatable resources, allowing multiple pods to share the same physical GPU.

<mark style="color:green;">**Time-Slicing GPUs**</mark>

Another approach to share GPUs is through time-slicing.

With time-slicing, multiple pods take turns using the GPU in a round-robin fashion. Each pod gets a portion of the GPU time based on the number of replicas specified in the configuration.

To enable time-slicing, you define a ConfigMap with the desired time-slicing profiles.&#x20;

For example:

```yaml
version: v1
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4
```

This configuration allows up to four pods to share the same GPU by time-slicing.&#x20;

Kubernetes manages the scheduling and context switching between the pods.

#### <mark style="color:green;">Optimising Deep Learning Workloads</mark>

In addition to GPU sharing techniques, there are several optimisations you can apply to your deep learning workloads to further improve performance and resource utilisation:

1. Low Precision Arithmetic: Using lower precision, such as half-precision (FP16), can significantly reduce memory footprint and improve computation speed without sacrificing much accuracy.
2. Attention Slicing and Flash Attention: For transformer-based models, techniques like attention slicing and flash attention can optimize the attention mechanism, reducing memory usage and enabling larger sequence lengths.
3. Speculative Decoding: For language models, speculative decoding can increase the batch size of a large model by using a smaller model to predict tokens ahead of time, improving inference efficiency.
4. Model Distillation: Distilling knowledge from a large teacher model to a smaller student model can trade off some accuracy for improved performance and reduced resource requirements.

### <mark style="color:purple;">Conclusion</mark>

Maximising GPU utilisation is crucial for cost-effective and efficient AI workloads.&#x20;

By leveraging Kubernetes and the NVIDIA GPU Operator, you can easily manage and share GPU resources across containers.&#x20;

Techniques like Multi-Instance GPU (MIG) and time-slicing allow multiple pods to utilize the same physical GPU, while optimizations such as low precision arithmetic, attention slicing, speculative decoding, and model distillation can further enhance the performance of deep learning workloads.&#x20;

By combining these approaches, you can significantly improve GPU utilization and make the most of your valuable GPU resources.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/infrastructure/the-modern-data-centre/maximising-gpu-utilisation-with-kubernetes-and-nvidia-gpu-operator.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
