Running Local LLMs on a Docker Server: A Hands‑On Guide
- Nishadil
- June 23, 2026
- 0 Comments
- 5 minutes read
- 9 Views
- Save
- Follow Topic
How I Hooked Up My Docker Host to Open‑Source Large Language Models
A step‑by‑step walk‑through of installing Docker, pulling a local LLM container, and accessing it via API – all without relying on cloud services.
Ever felt a little uneasy about sending every prompt to a cloud‑hosted AI? Me too. I wanted a private, offline playground where I could tinker with large language models without the usual latency, data‑privacy concerns, or surprise billing. The solution? Spin up a Docker server on my home workstation and point a handful of tools at it.
First things first – Docker. If you haven’t already installed it, grab the latest Community Edition for your OS. The installer walks you through the basics, but the real magic happens once the daemon is humming. I ran docker version to confirm everything was in order, and the output showed both client and server versions nicely aligned.
Next up, choosing a model. I settled on LLaMA‑2‑7B‑Chat because it offers a decent trade‑off between capability and resource demand. Thanks to the open‑source community, a ready‑to‑run Docker image is already published on Docker Hub. Pulling it is as simple as:
docker pull ghcr.io/ollama/llama2:7b-chat
That command may take a while – the image is a few gigabytes – but once it’s on disk you’re ready to launch the container. I used a modest command that maps port 11434 on the host to the same port inside the container, allocates a couple of CPU cores, and gives the container a friendly name:
docker run -d \ --name local‑llm \ -p 11434:11434 \ --cpus="2.0" \ ghcr.io/ollama/llama2:7b-chat
Notice the -d flag? That tells Docker to run the container in the background – just the way I like it. You can verify it’s up with docker ps; you should see something like “0.0.0.0:11434->11434/tcp”. At this point the model is loading, which can take a minute or two depending on your hardware.
Now for the fun part: talking to the model. The container ships with a tiny REST‑like endpoint that accepts JSON payloads. A quick curl test proves it works:
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello, LLM!"}'
The response contains the model’s generated text, wrapped in a JSON object. If you see something like “Hello, human! How can I assist you today?” you’re good to go. From here you can integrate the endpoint into any app – a Python script, an Android client, or even a simple web UI.
One hiccup I ran into was memory pressure on a laptop without a dedicated GPU. The 7‑b parameter model squeezes into about 8 GB of RAM, but if you plan to scale up to 13‑b or 30‑b variants, you’ll need either a beefier machine or a GPU‑enabled Docker runtime. Adding --gpus all to the docker run command (and ensuring NVIDIA drivers are installed) solves that for most modern setups.
To keep things tidy, I wrote a tiny docker-compose.yml file. It abstracts the run command, lets me spin the container up with docker compose up -d, and even adds a volume so the model cache survives container restarts:
version: "3.9"
services:
llm:
image: ghcr.io/ollama/llama2:7b-chat
container_name: local‑llm
ports:
- "11434:11434"
deploy:
resources:
limits:
cpus: "2"
volumes:
- ./model‑cache:/root/.cache/ollama
With that in place, managing the server feels like flipping a switch. Stop it with docker compose down, start it again later, and the model is ready exactly where you left it.
All told, hooking a Docker server up to a local LLM is surprisingly straightforward. The biggest payoff is control – you decide when to upgrade, which model to run, and how much hardware you allocate. No more “my prompt disappeared into the cloud” worries, just a solid, offline AI sandbox you can tweak to your heart’s content.
Editorial note: Nishadil may use AI assistance for news drafting and formatting. Readers can report issues from this page, and material corrections are reviewed under our editorial standards.