Skip to content

wilsonwu/run-gemma-4

Repository files navigation

Run Gemma 4 Anywhere

中文说明

This repository packages one practical Gemma 4 inference path that users can start from either Docker Compose or Kubernetes.

The default experience is CPU-first and uses llama.cpp + GGUF, because that is the most realistic way to offer a one-command inference setup across laptops, Docker hosts, and Kubernetes clusters.

Preview

Example response from the default local chat UI:

Example inference response in the English UI

Performance Snapshot

All rows below use the same benchmark profile unless noted otherwise:

  • Endpoint: /completion
  • Method: 1 warm-up request, then 5 measured requests
  • Request shape: default prompt from scripts/benchmark_completion.py, 19 prompt tokens on this model, n_predict=128, temperature=0.1, ignore_eos=true
  • Image repository: ghcr.io/wilsonwu/run-gemma-4
Date Host CPU Deployment Image tag Model Avg gen tokens/s Gen range Avg prompt tokens/s Avg gen time Notes
2026-04-09 Apple M4 Pro Docker Compose sha-d987db5 gemma-4-E2B-it-Q4_K_M.gguf 48.89 48.50-49.65 82.45 2618.5 ms Local baseline

Treat this as a machine-specific reference point, not a universal guarantee. Throughput will move with CPU model, Docker resource allocation, prompt length, output length, and concurrent load.

To reproduce or append a new row, run:

python3 scripts/benchmark_completion.py \
  --host-cpu "Apple M4 Pro" \
  --deployment "Docker Compose" \
  --image-tag sha-d987db5 \
  --model-file gemma-4-E2B-it-Q4_K_M.gguf \
  --notes "Local baseline"

The script prints per-run prompt and generation throughput, then emits one Markdown table row you can paste back into the snapshot table.

What Users Get

  • A published image on GHCR.
  • A ready-to-run compose.yaml for local validation.
  • A ready-to-run standard Kubernetes manifest set.
  • Resumable model downloads with SHA256 verification.
  • Configurable model download URLs and proxy variables.

Default Runtime

The default published image is intentionally focused on the practical path:

  • Runtime: llama.cpp
  • Model format: GGUF
  • Default model source: ModelScope
  • Default model file: gemma-4-E2B-it-Q4_K_M.gguf

The repository now intentionally keeps only this runtime path, so there is no secondary transformers or ollama branch to maintain.

Network Notes

Image pulling and model downloading are intentionally separated:

  • Container image source: ghcr.io/wilsonwu/run-gemma-4
  • Model file source: whatever URL you set in MODEL_URL

For users in mainland China:

  • The default MODEL_URL already points to ModelScope because it is usually easier to reach and faster than global model hubs.
  • GHCR can still be slow or unstable on some China networks. For Compose, override IMAGE_REPO in .env. For Kubernetes, replace both image references in k8s/deployment.yaml with your mirrored or private registry.
  • If you must keep using GHCR directly, set HTTP_PROXY, HTTPS_PROXY, and NO_PROXY to match your network environment.

For users outside China:

  • Pulling the image directly from GHCR is usually the simplest option.
  • If ModelScope is not the fastest model source in your region, replace MODEL_URL in .env or k8s/configmap.yaml with a closer GGUF download URL.
  • Image source and model source can be mixed freely. For example, you can keep GHCR for the image and use another public object store for the model.

Docker Compose Quick Start

  1. Run the guided installer:
bash install.sh

On Windows PowerShell, you can launch the same flow with:

.\install.ps1

If you prefer a shell environment, run bash install.sh from Git Bash or WSL after Docker Desktop is already running.

  1. The script checks Docker, creates or updates .env, prompts for the values that usually need operator input, and starts Docker Compose for you.

  2. Before prompting, the installer can probe GitHub, GHCR, and ModelScope. On mainland-China-like networks it will recommend keeping the ModelScope model URL, importing proxy values from the current shell when available, and prompting earlier for a mirrored IMAGE_REPO if GHCR looks restricted.

  3. If you prefer the manual path, copy .env.example to .env, review MODEL_URL, MODEL_SHA256, and IMAGE_TAG, then start the stack:

docker compose up -d
  1. Watch the model preparation phase if this is the first run:
docker compose logs -f prepare-model
  1. Send a smoke test request:
curl http://127.0.0.1:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Please answer in one short sentence. What is Kubernetes Service?\nAnswer:",
    "n_predict": 96,
    "temperature": 0.1,
    "stop": ["\n\n"]
  }'

Notes:

  • Compose defaults to ghcr.io/wilsonwu/run-gemma-4:latest.
  • The compose file bind-mounts the local docker/entrypoint.sh and docker/prepare-model.sh, so local script updates take effect immediately.
  • Runtime proxy variables are supported through .env.example.
  • If the installer detects mainland-China-like network conditions, it will surface a GHCR-specific recommendation before you confirm .env.
  • If a model download is interrupted, restarting Compose will resume the download.
  • If a downloaded GGUF file is corrupt, it will be deleted and downloaded again automatically.
  • install.sh also supports bash install.sh --yes for default values and bash install.sh --no-start if you only want to prepare .env.

Kubernetes Quick Start

The Kubernetes entry point is k8s.

  1. Review and edit k8s/configmap.yaml.
  2. Create the namespace first:
kubectl apply -f k8s/namespace.yaml
  1. If your GHCR package is private, create an image pull secret and uncomment the imagePullSecrets block in k8s/deployment.yaml:
kubectl -n gemma-cpu create secret docker-registry ghcr-creds \
  --docker-server=ghcr.io \
  --docker-username=YOUR_GITHUB_USERNAME \
  --docker-password=YOUR_GHCR_TOKEN
  1. If you want to pin a release image, replace ghcr.io/wilsonwu/run-gemma-4:latest in both image fields in k8s/deployment.yaml.

  2. Apply the manifests:

kubectl apply -f k8s/
  1. Forward the service locally:
kubectl -n gemma-cpu port-forward svc/gemma-inference 8080:80
  1. Send the same smoke test request:
curl http://127.0.0.1:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Please answer in one short sentence. What is Kubernetes Service?\nAnswer:",
    "n_predict": 96,
    "temperature": 0.1,
    "stop": ["\n\n"]
  }'

Notes:

  • The standard Kubernetes path no longer depends on Kustomize.
  • All namespaced resources now explicitly target gemma-cpu, so the YAMLs can be applied directly.

Model Source Strategy

The project is designed so users can point model preparation to different download sources depending on network conditions.

Recommended knobs:

  • MODEL_URL: direct GGUF file URL for llama.cpp
  • MODEL_SHA256: optional but recommended integrity check
  • HTTP_PROXY / HTTPS_PROXY / NO_PROXY: host or cluster level proxy settings

Current defaults:

  • GGUF direct download: ModelScope

This split exists for a reason:

  • direct GGUF files are the simplest and most stable distribution format for this repository
  • using a full MODEL_URL lets you switch to any mirror, object storage endpoint, or internal artifact server without changing the code

Image Publishing

Image builds are handled by GitHub Actions in .github/workflows/build-image.yml.

Publishing rules:

  • Push to main: publish ghcr.io/wilsonwu/run-gemma-4:latest
  • Push to main: publish ghcr.io/wilsonwu/run-gemma-4:sha-<short-sha>
  • Push a Git tag such as v0.2.0: publish ghcr.io/wilsonwu/run-gemma-4:v0.2.0

Default published platforms:

  • linux/amd64
  • linux/arm64

After a successful push, the image is already stored in GitHub Packages because GHCR is GitHub Packages for container images. The workflow also writes the exact published tags into the GitHub Actions job summary, so you can quickly see which tag is the newest one from that run.

Optional workflow_dispatch parameters:

  • http_proxy
  • https_proxy
  • no_proxy
  • platforms

The workflow keeps editable defaults near the top of .github/workflows/build-image.yml, so automatic main and tag builds stay simple while proxy and platform overrides remain available.

The default multi-arch build means Docker will normally pull the correct image variant automatically on Apple Silicon, ARM servers, and x86_64 hosts. The local Compose file no longer forces linux/amd64 by default for that reason.

If the Build and push image step fails with a GHCR 403 Forbidden even though the login step succeeded, that usually means authentication worked but the current token is not allowed to write to the existing package. This often happens when the package was first created by a local PAT push instead of by this repository's Actions workflow.

First check the package permission model on GitHub:

  • Open the existing GHCR package settings for ghcr.io/wilsonwu/run-gemma-4
  • Make sure this repository has Actions access to that package
  • If the package was created outside this repository workflow, relink it or grant repository access there

Recommended fix:

  • Add repository secret GHCR_TOKEN: a classic personal access token with at least write:packages and read:packages
  • Add repository secret GHCR_USERNAME: the GitHub username that owns that token

The current workflow is attached to the GitHub Environment run-gemma-4. If that environment contains GHCR_TOKEN, the login step will prefer that PAT automatically; otherwise it falls back to the built-in GITHUB_TOKEN.

If your PAT belongs to the repository owner account, GHCR_USERNAME is not needed. Only add it if you later decide to customize the login logic further.

Use GitHub Actions as the default publishing path. The fallback script docker/publish-ghcr.sh is still available for local publishing and accepts the same categories of parameters through environment variables.

Repository Layout

License

See LICENSE.

About

Run Gemma 4 Model Inference with a single click in Kubernetes and Docker without requiring a GPU.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors