Running Local LLMs on a Docker Server: A Hands‑On Guide

How I Hooked Up My Docker Host to Open‑Source Large Language Models

A step‑by‑step walk‑through of installing Docker, pulling a local LLM container, and accessing it via API – all without relying on cloud services.

Ever felt a little uneasy about sending every prompt to a cloud‑hosted AI? Me too. I wanted a private, offline playground where I could tinker with large language models without the usual latency, data‑privacy concerns, or surprise billing. The solution? Spin up a Docker server on my home workstation and point a handful of tools at it.

First things first – Docker. If you haven’t already installed it, grab the latest Community Edition for your OS. The installer walks you through the basics, but the real magic happens once the daemon is humming. I ran docker version to confirm everything was in order, and the output showed both client and server versions nicely aligned.

Next up, choosing a model. I settled on LLaMA‑2‑7B‑Chat because it offers a decent trade‑off between capability and resource demand. Thanks to the open‑source community, a ready‑to‑run Docker image is already published on Docker Hub. Pulling it is as simple as:

docker pull ghcr.io/ollama/llama2:7b-chat

That command may take a while – the image is a few gigabytes – but once it’s on disk you’re ready to launch the container. I used a modest command that maps port 11434 on the host to the same port inside the container, allocates a couple of CPU cores, and gives the container a friendly name:

docker run -d \
  --name local‑llm \
  -p 11434:11434 \
  --cpus="2.0" \
  ghcr.io/ollama/llama2:7b-chat

Notice the -d flag? That tells Docker to run the container in the background – just the way I like it. You can verify it’s up with docker ps; you should see something like “0.0.0.0:11434->11434/tcp”. At this point the model is loading, which can take a minute or two depending on your hardware.

Now for the fun part: talking to the model. The container ships with a tiny REST‑like endpoint that accepts JSON payloads. A quick curl test proves it works:

curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello, LLM!"}'

The response contains the model’s generated text, wrapped in a JSON object. If you see something like “Hello, human! How can I assist you today?” you’re good to go. From here you can integrate the endpoint into any app – a Python script, an Android client, or even a simple web UI.

One hiccup I ran into was memory pressure on a laptop without a dedicated GPU. The 7‑b parameter model squeezes into about 8 GB of RAM, but if you plan to scale up to 13‑b or 30‑b variants, you’ll need either a beefier machine or a GPU‑enabled Docker runtime. Adding --gpus all to the docker run command (and ensuring NVIDIA drivers are installed) solves that for most modern setups.

To keep things tidy, I wrote a tiny docker-compose.yml file. It abstracts the run command, lets me spin the container up with docker compose up -d, and even adds a volume so the model cache survives container restarts:

version: "3.9"
services:
  llm:
    image: ghcr.io/ollama/llama2:7b-chat
    container_name: local‑llm
    ports:
      - "11434:11434"
    deploy:
      resources:
        limits:
          cpus: "2"
    volumes:
      - ./model‑cache:/root/.cache/ollama

With that in place, managing the server feels like flipping a switch. Stop it with docker compose down, start it again later, and the model is ready exactly where you left it.

All told, hooking a Docker server up to a local LLM is surprisingly straightforward. The biggest payoff is control – you decide when to upgrade, which model to run, and how much hardware you allocate. No more “my prompt disappeared into the cloud” worries, just a solid, offline AI sandbox you can tweak to your heart’s content.

Comments 0

Please login to post a comment. Login

No approved comments yet.

Editorial note: Nishadil may use AI assistance for news drafting and formatting. Readers can report issues from this page, and material corrections are reviewed under our editorial standards.

More on this topic