Setting Up an LLM Server on Your Home Kubernetes Cluster#

The new Llama 3.1 is out, and the fact that a smaller quantized version of the model can be easily deployed in a home environment is quite exciting. If you have some spare computing resources such as a GPU, I will totally recommend deploying an LLM server on your PC or home server for good reasons. On the one hand, you no longer have to pay a subscription fee for a limited-usage service if you primarily use it for less complex tasks. On the other hand, smaller models offer the advantages of better energy efficiency, privacy, stable availability, and the potential of custom integration with other local services like smart home assistants.

In this blog post, I will go through setting up an LLM server on my home Kubernetes cluster. I will be using Ollama in a container to host the LLM service, and optionally, deploy a client web interface with Open WebUI. While the UI is not a requirement to my own needs, it is nice to have since you can easily share the service with others in the same network. Therefore I will be deploying an Open WebUI instance alongside as well. My Kubernetes cluster with version 1.30 is running on multiple Ubuntu 24.04 Server nodes.

A quick reminder: for those planning to use LLMs on a PC or seeking a simpler installation on a server with Docker, there are more straightforward options available that don’t require a more complex Kubernetes configuration. Ollama does provide native installers for macOS, Linux, and Windows. Alternatively, spawning a Docker container directly is another option for local deployment. It’s also worth noting that this guide demonstrates the use of CUDA capable GPUs from NVIDIA. Setting up GPUs from other vendors may require different steps, which won’t be covered here.

For the audience of this post, I will assume you already have a Kubernetes cluster, if not, you may check out my other post for setting one up.

In this guide, we’ll cover the following tasks to set up an LLM server on your home Kubernetes cluster:

  1. Configuring a worker node for GPU support

  2. configuring your Kubernetes cluster to use the GPUs

  3. deploying Ollama on the cluster

  4. optionally, deploying Open WebUI

Setting Up a GPU Supported Node (NVIDIA)#

The first step is to ensure that your worker node has GPU properly configured. While Ollama’s LLM serving part doesn’t strictly require a GPU to run its model hosting service, having one is crucial for efficient LLM inference. Without a GPU, processing times can become painfully long. In this guide, we’ll focus on setting up NVIDIA GPUs, as they’re the most common choice for deep learning tasks.

0. What do we need to install from NVIDIA?

Please check if you have a CUDA capable GPU; and check if the drivers supports your operating system. For our purpose, we’ll follow the recommended approach of installing only the NVIDIA drivers.

1. Check If the GPU is Recognized by Your Host Node

You need to ensure that the GPU is recognized by your host node, This step confirms that your hardware setup is correct.

Command: check if the GPU is recognized#
lspci | grep -i nvidia

If your GPU is properly connected, you should see at least one line of output displaying your GPU’s name and version, even if you have not installed a driver. If you don’t see any output, then you may need to check your hardware connection or GPU passthrough settings if you’re using a virtual machine.

2*. Clean Out all NVIDIA Related Packages

To avoid potential package conflicts, it’s a good idea to remove any existing NVIDIA packages before installing the driver. This step assumes your GPU will be dedicated to the Kubernetes cluster only.

Command: removing CUDA toolkit and drivers (Ubuntu and Debian only)#
sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"

sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"

sudo apt-get autoremove

3 Install the NVIDIA Drivers (Canonical’s Installation and Using the APT Package Manger)

The recommended way of installing NVIDIA drivers on a Ubuntu system is to use the Ubuntu-Drivers Tool especially when your system uses Secure Boot.

Command: install the driver through Ubuntu-Drivers Tool#
sudo ubuntu-drivers list --gpgpu

sudo ubuntu-drivers install --gpgpu nvidia:535-server

nvidia:535-server is the version of the driver you want to install, 535-server means the server version of the 535 branch. Replace this with the version of the driver you want to install. The driver version can be found by running the first command.

It should work out most of the time. Unfortunately for me, I might have been caught in temporary compatibility issues. I had to install the driver differently using the APT Package Manger. You may use the download builder to find the correct driver version for your system.

Command: install the drivers using APT#
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y nvidia-driver-535-server
sudo apt-get install -y cuda-drivers-535
sudo reboot

4 Final Check for Driver Status

After installation and reboot, verify that the driver is properly installed.

Command: check if the driver is installed properly#
dpkg -l | grep nvidia-driver

nvidia-smi

dpkg -l | grep nvidia-driver checks if there is a driver installed. nvidia-smi command should produce output similar to this.

 1+---------------------------------------------------------------------------------------+
 2| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
 3|-----------------------------------------+----------------------+----------------------+
 4| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
 5| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
 6|                                         |                      |               MIG M. |
 7|=========================================+======================+======================|
 8|   0  NVIDIA GeForce RTX 3060        Off | 00000000:01:00.0 Off |                  N/A |
 9|  0%   44C    P8              17W / 170W |      3MiB / 12288MiB |      0%      Default |
10|                                         |                      |                  N/A |
11+-----------------------------------------+----------------------+----------------------+
12
13+---------------------------------------------------------------------------------------+
14| Processes:                                                                            |
15|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
16|        ID   ID                                                             Usage      |
17|=======================================================================================|
18|  No running processes found                                                           |
19+---------------------------------------------------------------------------------------+

This output confirms that your NVIDIA driver is installed and functioning correctly.

Installing NVIDIA Container Toolkit#

Having the driver installed could only allow us to use GPUs natively on the host system. We still need to install NVIDIA Container Toolkit, which enables users to run GPU-enabled containers. We can install it using the following steps.

Command: Add NVIDIA repository#
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

We first add the NVIDIA repository to your system and sets up the GPG key for package verification.

Command: install NVIDIA Container Toolkit#
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

We then update package index and install the NVIDIA Container Toolkit.

Command: configure runtime#
sudo nvidia-ctk runtime configure --runtime=containerd

sudo nano /etc/containerd/config.toml
/etc/containerd/config.toml#
...
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

We configure the runtime to use the NVIDIA Runtime by default by editing the containerd configuration file to set the default runtime name to ‘nvidia’.

Command: restart containerd#
sudo systemctl restart containerd

We have to restart the containerd service to apply the changes.

Installing NVIDIA GPU Operator in Kubernetes#

While many would choose to install NVIDIA device plugin for basic GPU usage in the Kubernetes environment, NVIDIA GPU Operator provides a more comprehensive solution to use GPUs in Kubernetes. NVIDIA GPU Operator is a plugin that you deploy to your Kubernetes cluster to automate the management of all NVIDIA software components needed to provision GPU. In addition, it allows advanced configuration such as time-slicing, which enables workloads that are scheduled on oversubscribed GPUs to interleave with one another.

In this blog post, we’ll go through the process of installing the NVIDIA GPU Operator in a Kubernetes cluster.

1*. Installing Helm

Helm is a package manager for Kubernetes that simplifies the deployment and management of applications by using charts to pre-configure Kubernetes resources. As NVIDIA’s official guide uses Helm to install the GPU Operator, we will follow and use Helm as well.

If you haven’t installed Helm yet, this is snipped from the official Helm installation documentation:

Command: helm installation (on the controller node)#
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install apt-transport-https --yes
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

This series of commands adds the Helm repository, installs necessary dependencies, and finally installs Helm itself.

2. Installing NVIDIA GPU Operator

Now that we have Helm installed, we can use it to install NVIDIA GPU Operator.

Command: install GPU operator#
# install GPU operator:
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false \
      --timeout 15m

As we have pre-installed the NVIDIA GPU driver and NVIDIA Container toolkit manually before, we should disable the installation of the driver and toolkit in the GPU Operator setup by setting driver.enabled and toolkit.enabled to false.

If you have not installed both of them, you may set them to true, which automatically installs the driver and toolkit and configures the runtime for you. The reason why I still went through those steps earlier here is that no matter what method you use, you will need to install the driver and toolkit explicitly or implicitly. For those who do not use NVIDIA GPU operator, you may install an alternative instead.

Command: check the statuses of the GPU operator pods#
kubectl get all -n gpu-operator

You should expect all pods are running or complete as their status.

3. Label the GPU Nodes for Affinity

The final Kubernetes configuration is to label the nodes that have GPUs. This helps in scheduling GPU workloads to the correct nodes. Here’s how you can do it.

Command: label nodes with GPU#
kubectl label nodes k8s-worker-2 accelerator=nvidia-3060-12GB
kubectl get nodes --show-labels

Running a GPU Enabled Container (For Testing)#

Although this section is not directly related to setting up a LLM server, we can quickly abstract the necessary settings for GPU setup in Kubernetes by running a simple container with GPU as a resource. This approach, along with some examinations we can perform, will help us check for GPU utilization and ensure our cluster is properly configured for GPU workloads.

1. Check If Your GPU Is seen in Kubernetes resource

First, we need to verify if Kubernetes can detect your GPU. Run the following command.

Command: check if the GPU is recognized by Kubernetes#
kubectl describe node k8s-worker-2
# or
kubectl describe node k8s-worker-2 | grep nvidia.com/gpu
check if the GPU is recognized by Kubernetes#
 1...
 2  Hostname:    k8s-worker-2
 3Capacity:
 4  cpu:                32
 5  ephemeral-storage:  306453440Ki
 6  hugepages-1Gi:      0
 7  hugepages-2Mi:      0
 8  memory:             74096288Ki
 9  nvidia.com/gpu:     1
10  pods:               110
11...

You should expect to get multiple lines of nvidia.com/gpu: 1 in the output. You may have multiple GPUs, which will show nvidia.com/gpu: 2. or higher. If you don’t see anything, that means your GPU isn’t properly configured or recognized by Kubernetes.

2. Deploy a Simple Container with GPU

Now, let’s deploy a basic container with a GPU that only runs the nvidia-smi command. This will help us verify if the GPU is properly recognized within the container.

GPU-testing-container.yaml#
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: accelerator
            operator: In
            values:
            - nvidia-3060-12GB
  containers:
  - name: gpu-test
    image: nvidia/cuda:12.1.0-base-ubi8
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

It is common to encounter incompatibility issue between the host driver and CUDA version in the container, e.g, requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown. The NVIDIA driver reports a maximum version of CUDA supported (upper right in the table) and can run applications built with CUDA Toolkits up to that version. Ensure you supply the compatible CUDA version in the container image tag or update your driver to a newer version.

If everything goes right, you should expect to see the same table produced by nvidia-smi in the container log. To check the log run kubectl logs -c <container-name> -p <pod-name>.

This should conclude the basic setup of using GPUs in Kubernetes.

GPU Sharing with Nvidia GPU Operator#

So far your NVIDIA GPUs can only be assigned to a single container instance, which is a significant limitation. This setup means your GPU will be in an idle state most of the time and dedicated to a single service. As we learned earlier, the NVIDIA GPU Operator can be used to enable GPU sharing, allowing multiple workloads to share a single GPU. We will configure the GPU Operator to enable GPU sharing with time slicing.

Setting up GPU sharing is rather simple. All you need to do is add a ConfigMap to the namespace used by the GPU operator. The device plugin will then pick up this ConfigMap and automatically configure GPU time slicing for you.

Here’s an example configuration for GPU time-slicing.

time-slicing-config.yaml#
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-all
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

By applying this configuration, you’re essentially telling the GPU Operator to divide each GPU into 4 virtual GPUs (i.e., scale resource with name nvidia.com/gpu to 4 replicas). This allows up to 4 containers to share a single physical GPU. More configurations related to the time slicing feature can be found in the official guide.

Setting Up Ollama on Kubernetes#

Ollama has been a very user-friendly tool that allows you to run LLMs locally. Its simplicity of pulling and running popular models with Ollama makes it accessible for everyone. Although Ollama may have yet to have the robustness and scalability for a large scale production serving, but it is a perfect choice for home use. Here we will go through setting up Ollama on Kubernetes.

1. Create a new Namespace

First, let’s create a dedicated namespace for our Ollama deployment.

namespace.yaml#
apiVersion: "v1"
kind: Namespace
metadata:
  name: ns-ollama

1. Create Deployment

Now, let’s set up the deployment for Ollama.

deployment.yaml#
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
  namespace: ns-ollama
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-3060-12GB
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        command: ["/bin/sh"]
        args: ["-c", "ollama serve"]  # Removed --gpus=all flag
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 16
            memory: "48Gi"
          requests:
            cpu: 8
            memory: "24Gi"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ns-ollama
spec:
  selector:
    app: ollama
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434

nodeAffinity ensures that the Ollama pods are scheduled on nodes with NVIDIA GeForce RTX 3060 GPUs. Also to allocate GPUs to the pods, you need to specify nvidia.com/gpu: 1 in the resources section.

It seems the newer release last month (2024-7) of Ollama supports parallel requests, which means it might not be necessary to run multiple instances for cutting down resource allocation. But we will be running two replicas anyway for this guide since we also want to test out whether GPU time slicing works.

According to the discussion and Meta’s requirements recommendation, you should expect to allocate at least 8 cores CPU, 16G of RAM and a decent GPU with 8G+ VRAM for a quantized 7B models for a decent experience. I might be running multiple models, so I allocate more RAM limit to each pod.

For simplicity, we will use ephemeral volumes in this guide instead. But I recommend that you should use persistent volume for your data as models can be large in size and there is user data to keep as well. Also shared volume is handy in this case as you do not want to download the same models in every pod.

check the pod status of the deployment and if the GPU is recognized in the container#
 1user@k8s-master-1:~$ kubectl get pods -n ns-ollama
 2NAME                                 READY   STATUS    RESTARTS   AGE
 3ollama-deployment-7c76bb65bb-bdpg2   1/1     Running   0          3d15h
 4ollama-deployment-7c76bb65bb-lp9wf   1/1     Running   0          3d15h
 5user@k8s-master-1:~$ kubectl exec -it ollama-deployment-7c76bb65bb-bdpg2 -n ns-ollama -- /bin/bash
 6root@ollama-deployment-7c76bb65bb-bdpg2:/# nvidia-smi
 7Thu Aug  1 06:56:21 2024
 8+---------------------------------------------------------------------------------------+
 9| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
10|-----------------------------------------+----------------------+----------------------+
11| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
12| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
13|                                         |                      |               MIG M. |
14|=========================================+======================+======================|
15|   0  NVIDIA GeForce RTX 3060        Off | 00000000:01:00.0 Off |                  N/A |
16|  0%   45C    P8              17W / 170W |      3MiB / 12288MiB |      0%      Default |
17|                                         |                      |                  N/A |
18+-----------------------------------------+----------------------+----------------------+
19
20+---------------------------------------------------------------------------------------+
21| Processes:                                                                            |
22|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
23|        ID   ID                                                             Usage      |
24|=======================================================================================|
25|  No running processes found                                                           |
26+---------------------------------------------------------------------------------------+

As both pods seem to be running, we can log into the each container and run nvidia-smi manually, here we go, one GPU is now allocated to 2 pods.

3. Pulling Ollama Models

Specific models in ollama need to be pulled manually before using them. There are two ways, one is through the command line interface in the pod container, and the other is through the REST API.

Command: pulling a model (from within the containers)#
ollama pull llama3.1

For REST API, you need to ensure you have configured an externally accessible service point, I am using Ingress with NodePort for this guide. Please refer to this section below for setting up Ingress.

Command: pulling a model (through the API)#
curl http://ollama.kube.local:32528/api/pull -d '{
  "name": "llama3.1"
}'

Setting Up Open WebUI for Ollama#

We then want to set up a web interface so that general user can interact with the LLM server. Open WebUI is a popular choice that supports Ollama.

1. Create a new Namespace

First, we’ll create a dedicated namespace for our Open WebUI deployment.

namespace.yaml#
apiVersion: v1
kind: Namespace
metadata:
  name: open-webui

Next, we’ll set up the deployment and a service for Open WebUI.

1. Create Deployment

deployment.yaml#
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui-deployment
  namespace: open-webui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - k8s-worker-2
      containers:
      - name: open-webui
        image: open-webui/open-webui:main
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "500m"
            memory: "500Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama-service.ns-ollama.svc.cluster.local:11434"
        - name: HF_ENDPOINT
          value: https://hf-mirror.com
        tty: true
        volumeMounts:
          - name: open-webui-data
            mountPath: /app/backend/data
      volumes:
        - name: open-webui-data
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: open-webui-service
  namespace: open-webui
spec:
  selector:
    app: open-webui
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

I am using the latest official image open-webui/open-webui:main, and specify the resource allocation slightly more than the default. For environment variables, the OLLAMA_BASE_URL is set to connect to the Ollama service within the cluster. The HF_ENDPOINT is configured to use a mirror for Hugging Face, you may omit this if you don’t have restricted internet. By default, Open WebUI uses port 8080, we simply forward this port to the service.

Accessing Services from Outside the Cluster (Ingress Setup)#

One common issue is making your services accessible from outside the cluster This is where Ingress comes into play, which defines the routing rules on external access to the services in a cluster. To set up Ingress, you need to have an Ingress controller installed in your cluster. I will be using the Nginx Ingress Controller for this guide.

To set up Ingress, we’ll create a manifest file that defines both an IngressClass and an Ingress resource. An IngressClass specifies which controller should implement these rules, here we tell Kubernetes that we want to use the NGINX Ingress Controller for the class called nginx. In the Ingress specs, we specify that external requests to ollama-ui.kube.local will be directed to the “open-webui-service” on port 8080, which is the port open webui service is listening on.

ingress-open-webui.yaml#
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: nginx
spec:
  controller: k8s.io/ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: open-webui-ingress
  namespace: open-webui
spec:
  ingressClassName: nginx
  rules:
  - host: ollama-ui.kube.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: open-webui-service
            port:
              number: 8080

To expose the Ollama service, you can create a similar Ingress resource. You’ll need to modify the host, service name, and port number to match your Ollama service configuration.

1user@k8s-master-1:~$ kubectl get ingress --all-namespaces
2NAMESPACE     NAME                 CLASS   HOSTS                  ADDRESS        PORTS   AGE
3ns-ollama     ollama-ingress       nginx   ollama.kube.local      192.168.1.92   80      5d3h
4open-webui    open-webui-ingress   nginx   ollama-ui.kube.local   192.168.1.92   80      4d
5user@k8s-master-1:~$ kubectl get services -n ingress-nginx
6NAME                                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
7ingress-nginx-controller             NodePort    10.97.18.19    <none>        80:32528/TCP,443:31711/TCP   8d
8ingress-nginx-controller-admission   ClusterIP   10.103.62.24   <none>        443/TCP                      8d

The output shows that our Ingress resources are correctly set up for both the Ollama and Open WebUI services. The Ingress controller service is exposed using NodePort, which allows external traffic to reach the Ingress controller. Subsequently, it then routes the requests to the services based on the Ingress rules. You may consider using an external load balancer as well.

With this setup, you should now be able to access your Open WebUI service from outside the cluster using the hostname http://ollama-ui.kube.local:32528. Remember to ensure your DNS is also configured to resolve this hostname. I have set up a local DNS server using Pi-hole. Alternatively, you may add an entry {node ip} ollama-ui.kube.local to your hosts file in a client machine within the same network.

Demo of Open WebUI for Ollama

It would be handy if you could access the LLM server from your mobile devices. Enchanted is a mobile client app that allows you to interact with Ollama from an IOS device. Ensure that you also expose Ollama service using Ingress. Simply connect the device to the same network where the LLM server is running, and in settings, enter the IP address with the of the server.

Demo of Enchanted for Ollama

This should conclude the server side and the client side setup of running and using an LLM service in a home lab environment.