Run Gemma 4 Anywhere

This repository packages one practical Gemma 4 inference path that users can start from either Docker Compose or Kubernetes.

The default experience is CPU-first and uses llama.cpp + GGUF, because that is the most realistic way to offer a one-command inference setup across laptops, Docker hosts, and Kubernetes clusters.

Preview

Example response from the default local chat UI:

Performance Snapshot

All rows below use the same benchmark profile unless noted otherwise:

Endpoint: /completion
Method: 1 warm-up request, then 5 measured requests
Request shape: default prompt from scripts/benchmark_completion.py, 19 prompt tokens on this model, n_predict=128, temperature=0.1, ignore_eos=true
Image repository: ghcr.io/wilsonwu/run-gemma-4

Date	Host CPU	Deployment	Image tag	Model	Avg gen tokens/s	Gen range	Avg prompt tokens/s	Avg gen time	Notes
2026-04-09	Apple M4 Pro	Docker Compose	`sha-d987db5`	`gemma-4-E2B-it-Q4_K_M.gguf`	`48.89`	`48.50-49.65`	`82.45`	`2618.5 ms`	Local baseline

Treat this as a machine-specific reference point, not a universal guarantee. Throughput will move with CPU model, Docker resource allocation, prompt length, output length, and concurrent load.

To reproduce or append a new row, run:

python3 scripts/benchmark_completion.py \
  --host-cpu "Apple M4 Pro" \
  --deployment "Docker Compose" \
  --image-tag sha-d987db5 \
  --model-file gemma-4-E2B-it-Q4_K_M.gguf \
  --notes "Local baseline"

The script prints per-run prompt and generation throughput, then emits one Markdown table row you can paste back into the snapshot table.

What Users Get

A published image on GHCR.
A ready-to-run compose.yaml for local validation.
A ready-to-run standard Kubernetes manifest set.
Resumable model downloads with SHA256 verification.
Configurable model download URLs and proxy variables.

Default Runtime

The default published image is intentionally focused on the practical path:

Runtime: llama.cpp
Model format: GGUF
Default model source: ModelScope
Default model file: gemma-4-E2B-it-Q4_K_M.gguf

The repository now intentionally keeps only this runtime path, so there is no secondary transformers or ollama branch to maintain.

Network Notes

Image pulling and model downloading are intentionally separated:

Container image source: ghcr.io/wilsonwu/run-gemma-4
Model file source: whatever URL you set in MODEL_URL

For users in mainland China:

The default MODEL_URL already points to ModelScope because it is usually easier to reach and faster than global model hubs.
GHCR can still be slow or unstable on some China networks. For Compose, override IMAGE_REPO in .env. For Kubernetes, replace both image references in k8s/deployment.yaml with your mirrored or private registry.
If you must keep using GHCR directly, set HTTP_PROXY, HTTPS_PROXY, and NO_PROXY to match your network environment.

For users outside China:

Pulling the image directly from GHCR is usually the simplest option.
If ModelScope is not the fastest model source in your region, replace MODEL_URL in .env or k8s/configmap.yaml with a closer GGUF download URL.
Image source and model source can be mixed freely. For example, you can keep GHCR for the image and use another public object store for the model.

Docker Compose Quick Start

Run the guided installer:

bash install.sh

On Windows PowerShell, you can launch the same flow with:

.\install.ps1

If you prefer a shell environment, run bash install.sh from Git Bash or WSL after Docker Desktop is already running.

The script checks Docker, creates or updates .env, prompts for the values that usually need operator input, and starts Docker Compose for you.
Before prompting, the installer can probe GitHub, GHCR, and ModelScope. On mainland-China-like networks it will recommend keeping the ModelScope model URL, importing proxy values from the current shell when available, and prompting earlier for a mirrored IMAGE_REPO if GHCR looks restricted.
If you prefer the manual path, copy .env.example to .env, review MODEL_URL, MODEL_SHA256, and IMAGE_TAG, then start the stack:

docker compose up -d

Watch the model preparation phase if this is the first run:

docker compose logs -f prepare-model

Send a smoke test request:

curl http://127.0.0.1:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Please answer in one short sentence. What is Kubernetes Service?\nAnswer:",
    "n_predict": 96,
    "temperature": 0.1,
    "stop": ["\n\n"]
  }'

Notes:

Compose defaults to ghcr.io/wilsonwu/run-gemma-4:latest.
The compose file bind-mounts the local docker/entrypoint.sh and docker/prepare-model.sh, so local script updates take effect immediately.
Runtime proxy variables are supported through .env.example.
If the installer detects mainland-China-like network conditions, it will surface a GHCR-specific recommendation before you confirm .env.
If a model download is interrupted, restarting Compose will resume the download.
If a downloaded GGUF file is corrupt, it will be deleted and downloaded again automatically.
install.sh also supports bash install.sh --yes for default values and bash install.sh --no-start if you only want to prepare .env.

Kubernetes Quick Start

The Kubernetes entry point is k8s.

Review and edit k8s/configmap.yaml.
Create the namespace first:

kubectl apply -f k8s/namespace.yaml

If your GHCR package is private, create an image pull secret and uncomment the imagePullSecrets block in k8s/deployment.yaml:

kubectl -n gemma-cpu create secret docker-registry ghcr-creds \
  --docker-server=ghcr.io \
  --docker-username=YOUR_GITHUB_USERNAME \
  --docker-password=YOUR_GHCR_TOKEN

If you want to pin a release image, replace ghcr.io/wilsonwu/run-gemma-4:latest in both image fields in k8s/deployment.yaml.
Apply the manifests:

kubectl apply -f k8s/

Forward the service locally:

kubectl -n gemma-cpu port-forward svc/gemma-inference 8080:80

Send the same smoke test request:

curl http://127.0.0.1:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Please answer in one short sentence. What is Kubernetes Service?\nAnswer:",
    "n_predict": 96,
    "temperature": 0.1,
    "stop": ["\n\n"]
  }'

Notes:

The standard Kubernetes path no longer depends on Kustomize.
All namespaced resources now explicitly target gemma-cpu, so the YAMLs can be applied directly.

Model Source Strategy

The project is designed so users can point model preparation to different download sources depending on network conditions.

Recommended knobs:

MODEL_URL: direct GGUF file URL for llama.cpp
MODEL_SHA256: optional but recommended integrity check
HTTP_PROXY / HTTPS_PROXY / NO_PROXY: host or cluster level proxy settings

Current defaults:

GGUF direct download: ModelScope

This split exists for a reason:

direct GGUF files are the simplest and most stable distribution format for this repository
using a full MODEL_URL lets you switch to any mirror, object storage endpoint, or internal artifact server without changing the code

Image Publishing

Image builds are handled by GitHub Actions in .github/workflows/build-image.yml.

Publishing rules:

Push to main: publish ghcr.io/wilsonwu/run-gemma-4:latest
Push to main: publish ghcr.io/wilsonwu/run-gemma-4:sha-<short-sha>
Push a Git tag such as v0.2.0: publish ghcr.io/wilsonwu/run-gemma-4:v0.2.0

Default published platforms:

linux/amd64
linux/arm64

After a successful push, the image is already stored in GitHub Packages because GHCR is GitHub Packages for container images. The workflow also writes the exact published tags into the GitHub Actions job summary, so you can quickly see which tag is the newest one from that run.

Optional workflow_dispatch parameters:

http_proxy
https_proxy
no_proxy
platforms

The workflow keeps editable defaults near the top of .github/workflows/build-image.yml, so automatic main and tag builds stay simple while proxy and platform overrides remain available.

The default multi-arch build means Docker will normally pull the correct image variant automatically on Apple Silicon, ARM servers, and x86_64 hosts. The local Compose file no longer forces linux/amd64 by default for that reason.

If the Build and push image step fails with a GHCR 403 Forbidden even though the login step succeeded, that usually means authentication worked but the current token is not allowed to write to the existing package. This often happens when the package was first created by a local PAT push instead of by this repository's Actions workflow.

First check the package permission model on GitHub:

Open the existing GHCR package settings for ghcr.io/wilsonwu/run-gemma-4
Make sure this repository has Actions access to that package
If the package was created outside this repository workflow, relink it or grant repository access there

Recommended fix:

Add repository secret GHCR_TOKEN: a classic personal access token with at least write:packages and read:packages
Add repository secret GHCR_USERNAME: the GitHub username that owns that token

The current workflow is attached to the GitHub Environment run-gemma-4. If that environment contains GHCR_TOKEN, the login step will prefer that PAT automatically; otherwise it falls back to the built-in GITHUB_TOKEN.

If your PAT belongs to the repository owner account, GHCR_USERNAME is not needed. Only add it if you later decide to customize the login logic further.

Use GitHub Actions as the default publishing path. The fallback script docker/publish-ghcr.sh is still available for local publishing and accepts the same categories of parameters through environment variables.

Repository Layout

Dockerfile: container image definition
compose.yaml: local one-command entry point
install.sh: guided Docker Compose launcher for macOS, Linux, and Windows shells such as Git Bash or WSL
install.ps1: Windows PowerShell wrapper that launches the same guided installer flow
.env.example: Compose environment template
docker/prepare-model.sh: model download logic with resume and checksum verification
docker/entrypoint.sh: runtime dispatch logic
k8s: standard Kubernetes manifests
.github/workflows/build-image.yml: CI image publishing workflow

License

See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run Gemma 4 Anywhere

Preview

Performance Snapshot

What Users Get

Default Runtime

Network Notes

Docker Compose Quick Start

Kubernetes Quick Start

Model Source Strategy

Image Publishing

Repository Layout

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
docker		docker
k8s		k8s
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
compose.yaml		compose.yaml
install.ps1		install.ps1
install.sh		install.sh
screenshot-en.png		screenshot-en.png
screenshot-zh.png		screenshot-zh.png

Folders and files

Latest commit

History

Repository files navigation

Run Gemma 4 Anywhere

Preview

Performance Snapshot

What Users Get

Default Runtime

Network Notes

Docker Compose Quick Start

Kubernetes Quick Start

Model Source Strategy

Image Publishing

Repository Layout

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages