This repository packages one practical Gemma 4 inference path that users can start from either Docker Compose or Kubernetes.
The default experience is CPU-first and uses llama.cpp + GGUF, because that is the most realistic way to offer a one-command inference setup across laptops, Docker hosts, and Kubernetes clusters.
Example response from the default local chat UI:
All rows below use the same benchmark profile unless noted otherwise:
- Endpoint:
/completion - Method: 1 warm-up request, then 5 measured requests
- Request shape: default prompt from scripts/benchmark_completion.py, 19 prompt tokens on this model,
n_predict=128,temperature=0.1,ignore_eos=true - Image repository:
ghcr.io/wilsonwu/run-gemma-4
| Date | Host CPU | Deployment | Image tag | Model | Avg gen tokens/s | Gen range | Avg prompt tokens/s | Avg gen time | Notes |
|---|---|---|---|---|---|---|---|---|---|
| 2026-04-09 | Apple M4 Pro | Docker Compose | sha-d987db5 |
gemma-4-E2B-it-Q4_K_M.gguf |
48.89 |
48.50-49.65 |
82.45 |
2618.5 ms |
Local baseline |
Treat this as a machine-specific reference point, not a universal guarantee. Throughput will move with CPU model, Docker resource allocation, prompt length, output length, and concurrent load.
To reproduce or append a new row, run:
python3 scripts/benchmark_completion.py \
--host-cpu "Apple M4 Pro" \
--deployment "Docker Compose" \
--image-tag sha-d987db5 \
--model-file gemma-4-E2B-it-Q4_K_M.gguf \
--notes "Local baseline"The script prints per-run prompt and generation throughput, then emits one Markdown table row you can paste back into the snapshot table.
- A published image on GHCR.
- A ready-to-run
compose.yamlfor local validation. - A ready-to-run standard Kubernetes manifest set.
- Resumable model downloads with SHA256 verification.
- Configurable model download URLs and proxy variables.
The default published image is intentionally focused on the practical path:
- Runtime:
llama.cpp - Model format:
GGUF - Default model source: ModelScope
- Default model file:
gemma-4-E2B-it-Q4_K_M.gguf
The repository now intentionally keeps only this runtime path, so there is no secondary transformers or ollama branch to maintain.
Image pulling and model downloading are intentionally separated:
- Container image source:
ghcr.io/wilsonwu/run-gemma-4 - Model file source: whatever URL you set in
MODEL_URL
For users in mainland China:
- The default
MODEL_URLalready points to ModelScope because it is usually easier to reach and faster than global model hubs. - GHCR can still be slow or unstable on some China networks. For Compose, override
IMAGE_REPOin.env. For Kubernetes, replace both image references in k8s/deployment.yaml with your mirrored or private registry. - If you must keep using GHCR directly, set
HTTP_PROXY,HTTPS_PROXY, andNO_PROXYto match your network environment.
For users outside China:
- Pulling the image directly from GHCR is usually the simplest option.
- If ModelScope is not the fastest model source in your region, replace
MODEL_URLin.envor k8s/configmap.yaml with a closer GGUF download URL. - Image source and model source can be mixed freely. For example, you can keep GHCR for the image and use another public object store for the model.
- Run the guided installer:
bash install.shOn Windows PowerShell, you can launch the same flow with:
.\install.ps1If you prefer a shell environment, run bash install.sh from Git Bash or WSL after Docker Desktop is already running.
-
The script checks Docker, creates or updates
.env, prompts for the values that usually need operator input, and starts Docker Compose for you. -
Before prompting, the installer can probe
GitHub,GHCR, andModelScope. On mainland-China-like networks it will recommend keeping the ModelScope model URL, importing proxy values from the current shell when available, and prompting earlier for a mirroredIMAGE_REPOif GHCR looks restricted. -
If you prefer the manual path, copy
.env.exampleto.env, reviewMODEL_URL,MODEL_SHA256, andIMAGE_TAG, then start the stack:
docker compose up -d- Watch the model preparation phase if this is the first run:
docker compose logs -f prepare-model- Send a smoke test request:
curl http://127.0.0.1:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Please answer in one short sentence. What is Kubernetes Service?\nAnswer:",
"n_predict": 96,
"temperature": 0.1,
"stop": ["\n\n"]
}'Notes:
- Compose defaults to
ghcr.io/wilsonwu/run-gemma-4:latest. - The compose file bind-mounts the local
docker/entrypoint.shanddocker/prepare-model.sh, so local script updates take effect immediately. - Runtime proxy variables are supported through
.env.example. - If the installer detects mainland-China-like network conditions, it will surface a GHCR-specific recommendation before you confirm
.env. - If a model download is interrupted, restarting Compose will resume the download.
- If a downloaded GGUF file is corrupt, it will be deleted and downloaded again automatically.
install.shalso supportsbash install.sh --yesfor default values andbash install.sh --no-startif you only want to prepare.env.
The Kubernetes entry point is k8s.
- Review and edit k8s/configmap.yaml.
- Create the namespace first:
kubectl apply -f k8s/namespace.yaml- If your GHCR package is private, create an image pull secret and uncomment the
imagePullSecretsblock in k8s/deployment.yaml:
kubectl -n gemma-cpu create secret docker-registry ghcr-creds \
--docker-server=ghcr.io \
--docker-username=YOUR_GITHUB_USERNAME \
--docker-password=YOUR_GHCR_TOKEN-
If you want to pin a release image, replace
ghcr.io/wilsonwu/run-gemma-4:latestin both image fields in k8s/deployment.yaml. -
Apply the manifests:
kubectl apply -f k8s/- Forward the service locally:
kubectl -n gemma-cpu port-forward svc/gemma-inference 8080:80- Send the same smoke test request:
curl http://127.0.0.1:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Please answer in one short sentence. What is Kubernetes Service?\nAnswer:",
"n_predict": 96,
"temperature": 0.1,
"stop": ["\n\n"]
}'Notes:
- The standard Kubernetes path no longer depends on Kustomize.
- All namespaced resources now explicitly target
gemma-cpu, so the YAMLs can be applied directly.
The project is designed so users can point model preparation to different download sources depending on network conditions.
Recommended knobs:
MODEL_URL: direct GGUF file URL forllama.cppMODEL_SHA256: optional but recommended integrity checkHTTP_PROXY/HTTPS_PROXY/NO_PROXY: host or cluster level proxy settings
Current defaults:
- GGUF direct download: ModelScope
This split exists for a reason:
- direct GGUF files are the simplest and most stable distribution format for this repository
- using a full
MODEL_URLlets you switch to any mirror, object storage endpoint, or internal artifact server without changing the code
Image builds are handled by GitHub Actions in .github/workflows/build-image.yml.
Publishing rules:
- Push to
main: publishghcr.io/wilsonwu/run-gemma-4:latest - Push to
main: publishghcr.io/wilsonwu/run-gemma-4:sha-<short-sha> - Push a Git tag such as
v0.2.0: publishghcr.io/wilsonwu/run-gemma-4:v0.2.0
Default published platforms:
linux/amd64linux/arm64
After a successful push, the image is already stored in GitHub Packages because GHCR is GitHub Packages for container images. The workflow also writes the exact published tags into the GitHub Actions job summary, so you can quickly see which tag is the newest one from that run.
Optional workflow_dispatch parameters:
http_proxyhttps_proxyno_proxyplatforms
The workflow keeps editable defaults near the top of .github/workflows/build-image.yml, so automatic main and tag builds stay simple while proxy and platform overrides remain available.
The default multi-arch build means Docker will normally pull the correct image variant automatically on Apple Silicon, ARM servers, and x86_64 hosts. The local Compose file no longer forces linux/amd64 by default for that reason.
If the Build and push image step fails with a GHCR 403 Forbidden even though the login step succeeded, that usually means authentication worked but the current token is not allowed to write to the existing package. This often happens when the package was first created by a local PAT push instead of by this repository's Actions workflow.
First check the package permission model on GitHub:
- Open the existing GHCR package settings for
ghcr.io/wilsonwu/run-gemma-4 - Make sure this repository has Actions access to that package
- If the package was created outside this repository workflow, relink it or grant repository access there
Recommended fix:
- Add repository secret
GHCR_TOKEN: a classic personal access token with at leastwrite:packagesandread:packages - Add repository secret
GHCR_USERNAME: the GitHub username that owns that token
The current workflow is attached to the GitHub Environment run-gemma-4. If that environment contains GHCR_TOKEN, the login step will prefer that PAT automatically; otherwise it falls back to the built-in GITHUB_TOKEN.
If your PAT belongs to the repository owner account, GHCR_USERNAME is not needed. Only add it if you later decide to customize the login logic further.
Use GitHub Actions as the default publishing path. The fallback script docker/publish-ghcr.sh is still available for local publishing and accepts the same categories of parameters through environment variables.
- Dockerfile: container image definition
- compose.yaml: local one-command entry point
- install.sh: guided Docker Compose launcher for macOS, Linux, and Windows shells such as Git Bash or WSL
- install.ps1: Windows PowerShell wrapper that launches the same guided installer flow
- .env.example: Compose environment template
- docker/prepare-model.sh: model download logic with resume and checksum verification
- docker/entrypoint.sh: runtime dispatch logic
- k8s: standard Kubernetes manifests
- .github/workflows/build-image.yml: CI image publishing workflow
See LICENSE.
