Self-Hosted LLM Chatbot with Ollama and Open WebUI (No GPU Required)
Video
Transcript
Hello, this is the channel Easy Self Host.
In this video, we are going to run Ollama with Open WebUI.
A combo that runs large language models on your private server, and provides you with a chat experience similar to ChatGPT.
We are also going to explore the possibility of integration with LLM in other self-hosted apps.
Self-hosted language models have the benefit of privacy because our data will never leave our own computers.
So we can be more comfortable sharing personal data with them.
I’m going to run the latest llama3-8B model on my server with a Pentium CPU and 8 Gigs of memory.
This is a very modest setup for running LLM but you will see how usable it is.
Ollama is an open-source framework for building and running language models, and it can run as a server for generating responses and chatting.
Open WebUI provides the graphic user interface to interact with the Ollama server, and also manages the data like chat history.
We are going to run these two using Docker Compose.
This is the Docker Compose file we are going to run.
We start by defining the docker network for the proxy server.
This will allow our proxy server to connect to both services.
Next we are going to define two docker volumes.
The ollama volume will store data like language models for the Ollama server.
The open-webui volume will store data like chat history for the Open WebUI server.
For the service section, the first service is the Ollama server.
It uses the official Ollama image.
We are attaching this service to the proxy network, so it can be discovered by the proxy and the Open WebUI server.
Note that we are not using the default network here because both services will be exposed to the proxy-net.
There is no need for another interconnecting network.
Also, we are exposing the Ollama server to the proxy to allow clients other than the Open WebUI server as you will see later.
The Ollama server exposes the port 11434.
For volumes, we are going to map the ollama volume to the /root/.ollama path in the container.
Moving on to the Open WebUI service.
It uses the official image hosted on GitHub.
We are also attaching this service to the proxy network.
And it exposes the port 8080.
For volumes, we map the open-webui volume to the /app/backend/data path in the container.
For environment variables, we set the variable OLLAMA_BASE_URL to the Ollama service we just defined.
With this environment set, the Open WebUI will connect to our Ollama server at start.
We also need to update our proxy configuration for these services.
I’m using Caddy, so I’ll add a proxy rule to proxy the hostname chat.home.easyselfhost.com to the open-webui service at port 8080.
If you plan to use other clients for the Ollama server, you can create a proxy rule for it too.
I will map the ollama.home.easyselfhost.com to the ollama service.
To run these services, let’s go to the server command line and navigate to the directory that has the docker compose file.
Here we can simply run docker compose up -d to start the services.
To refresh the proxy config, I’ll restart my Caddy server by restarting its docker compose.
Now we can go to the Open WebUI app in the browser.
First let’s sign up for an account.
When we are logged in, we can start a message, and we will find that we don’t have a language model yet.
Let’s go to the setting, under models, we can type in the language model we want to download.
To find a model name, we can click this link, it will list all the models offered by Ollama.
I will choose llama3, a language model open-sourced by Meta.
It will take a while to download, as language models often have gigabytes of data.
After it finishes, we can exit the setting and choose the model at the top.
Then we can send it a message again to test if it works.
Well, let’s try to send it a more complicated message to check the quality and speed of its output.
I’m showing a timer in the video to show you how long it actually takes, as I’m going to speed up the video.
Large language models are good at multi-round conversation, so I’ll also send a follow-up question.
As you can see, it will take a while to generate responses on my low-power computer.
But the quality of responses are very impressive.
As a comparison, I also try running the same model on my laptop with the 9th Gen i9 CPU, and it generates much faster.
Running language models will benefit from faster CPUs or GPUs if you have one.
On slower CPUs with chat applications, I recommend using simpler prompts and limiting the length of the output for better user experience.
For Open WebUI, there are some other features to explore.
For example, we can upload models that are not hosted on Ollama.
We can also configure speech- to-text and text-to-speech to interact with language models using voice.
It can also be hooked up with an image generation model to generate images.
For self-hosted language models, I’ve been thinking about use cases other than chat.
One of the use cases I found useful is generating summaries for documents or notes.
Generating summaries can be done in the background so the speed of running language models doesn’t matter that much.
To verify this idea, I tried to implement this feature in Memos, a self-hosted note app I love.
I did a video about Memos before.
In my modified version of Memos, there’s a new setting for the large language model endpoint.
Here we can configure the language model URL, which is the Ollama endpoint we just set up.
And we can specify the model to use, in our case we can use the llama3.
We can also configure the template prompt to generate summaries for the memo content, but there is a default prompt.
After this, in each memo, there is a new button that we can display the generated summary.
If the summary hasn’t been generated in the background, we can also click the ‘generate now’ button.
A more common scenario I imagine is that we have a long memo like a meeting note.
When we just write it down, we probably don’t need a summary since our memory is fresh.
A day or two later, when we don’t have a clear memory, it’s nice to have a summary to help us quickly recall.
And that duration is sufficient for the language model to generate the summaries in the background.
I hope self-hosted language models can be integrated in more projects.
That’s all I want to show you today.
Please consider subscribing for content like this.
You can find the configuration files in this Video on GitHub and the link is in the description below.
Thank you for watching.