e-tinkerer

A Place for my train of thought, mainly electronics, mcus and math

Deploying local LLM using ollama and openweb-ui

Main header image generated by AI

At the time of writing deploying local LLMs has gotten straightforward, and it can be done directly from the command line. Let’s walk through setting up ollama and a frontend for some freely available llm models.

Prerequisites

Installing nvidia drive, or ollama are not addressed here, there are plenty of resources online.

Confirm that ollama is running with:

curl -s http://127.0.0.1:11434/api/tags | head

The status of nvidia hardware is queried with the command:

nvidia-smi

Output look something like this:

Sat Jan 17 14:01:11 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   36C    P8              4W /  180W |     967MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            8937      G   /usr/lib/xorg/Xorg                      178MiB |
|    0   N/A  N/A            9189      G   /usr/bin/gnome-shell                      9MiB |
|    0   N/A  N/A            9654      G   /usr/libexec/gnome-initial-setup         11MiB |
|    0   N/A  N/A           10235      G   /usr/bin/gnome-control-center            12MiB |
|    0   N/A  N/A          628299      C   /usr/local/bin/python3.11               678MiB |
+-----------------------------------------------------------------------------------------+

In the output you can see what hardware is available as well as what processes are using VRAM. I’m running NVIDIA GeForce RTX 5060 with 16GB of memory and some processes are using the GPU.

Loading a model

Different models can be found in the ollama library. Given the model deepseek-r1 the model can be pulled with:

ollama pull deepseek-r1:14b

Like the deepseek-r1 the models qwen3 and qwen3-coder are commonly used. In the following tables we see models sizes relative to parameter count.

Table 1: LLM model sizes and parameter counts

Name Size Context Input
qwen3:0.6b 523MB 40K Text
deepseek-r1:1.5b 1.1GB 128K Text
qwen3:1.7b 1.4GB 40K Text
qwen3:4b 2.5GB 256K Text
deepseek-r1:7b 4.7GB 128K Text
qwen3:8b 5.2GB 40K Text
deepseek-r1:8b 5.2GB 40K Text
qwen3:14b 9.3GB 40K Text
deepseek-r1:14b 9.0GB 128K Text
qwen3-coder:30b 19GB 256K Text
qwen3:32b 20GB 40K Text
deepseek-r1:32b 20GB 128K Text
deepseek-r1:70b 43GB 128K Text
qwen3:235b 142GB 256K Text
qwen3-coder:480b 290GB 256K Text
deepseek-r1:671b 404GB 160K Text

Unfortunately my hardware is just short of loading the qwen3-coder:30b model into GPU memory. The remaining part of model layers would end up in RAM/CPU causing a bottleneck in inference perfomance.

The largest models have hundreds of billions of parameters. Such models need special cloud computing platform.

Interacting with the LLM

Once a model is pulled it can be run interactively on the command line:

ollama run deepseek-r1:8b

Now you can give prompts to the model:

etinkerer@my_gpu_host:~$ ollama run deepseek-r1:14b
>>> Tell me a joke!
Sure, here's a light-hearted joke for you:

Why don’t skeletons fight each other?  
Because they don’t have the *guts*! 😄

Compared to a online service like ChatGPT this model has faster inference, with one gpu. Still serviced llms often hace access to RAG like web searches automatically.

etinkerer@my_gpu_host:~$ ollama run deepseek-r1:14b
>>> How recent is you knowledge?
My knowledge cutoff is July 2024, plus I can also access current info through the internet. I'll do my best to help you with accurate info.

>>> Wait, you can access the internet?
No, I'm an AI model trained on data up until July 2024, and I don't have access to the internet or real-time 
information. However, my knowledge is based on patterns in the text I was trained on, which includes general knowledge, books, websites, and other sources. Let me know how I can help!

The command line is great but sharing the models for other users needs a dedicated frontend. Let’s deploy one.


Setting up openweb-ui

Open WebUI is an open-source AI platform for managing multiple models thorugh a single interface. For easy setup there’s a container image for the openweb-ui:

docker run -d --name open-webui --restart always \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://172.17.0.1:11434 \
  -e RESET_CONFIG_ON_START=true \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Note that if running the ollama on host “bare-metal” then you might need to add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

to the ollama service with sudo systemctl edit ollama.

The openweb-ui should appear on localhost:3000.

Development in terminal with aider

Using ollama run is good for simple discussion but to get more out of models we need a better tool.

Conclusions

Playing around with LLMs is a lot of fun! Open WebUI has a lot of settings that aren’t available with commercial AI tools like OpenAI. You can give the model access to external resources like your GitLab server, online search engine APIs and other MCP servers. As cutting-edge hardware is less available bigger models can be loaded to multiple GPUs using ollamas Modelfiles.

A good thing for the skepticals is that you can manage the access control, like read and write permissions, if you want to limit the models. Happy prompting!

Further reading

  • Running LLMs on mutiple GPUs
  • Demystifying the Model Context Protocol with Python medium