Autonomous detection and remediation for container-style failures — proven on a deterministic, audit-friendly loop.
Executive snapshot: External replay recall improved from
6/11to11/11on the same captured lab run, while holding0false positives.
Repository · Specification · Help & FAQ · Governance template · Quick start
What has been delivered so far, with published evidence:
- Core deterministic loop is operational and validated across five harness experiments.
- Optional local Kubernetes lab pipeline is operational (
bootstrap->inject->collect/normalize/replay). - Published external replay report shows concrete improvement:
- before normalization refinement:
6/11detected/resolved,0false positives - after normalization refinement:
11/11detected/resolved,0false positives
- before normalization refinement:
- Full report:
docs/LAB_RUN_REPORT_20260331.md
flowchart LR
subgraph then["Start (initial PoC scope)"]
A1[data/seed.py synthetic JSON]
A2[Watcher -> Healer]
A3[harness.py experiments 1-3]
A4[metrics/results.db]
A1 --> A2 --> A3 --> A4
end
subgraph now["Current state (expanded scope)"]
B1[seed + k8s_clean_signals + near_real_stream]
B2[harness.py experiments 1-5 + integrations/validate.py gate]
B3[Optional lab pipeline: bootstrap -> inject -> collect]
B4[tools/normalize_external_capture.py]
B5[external replay scoring]
B6[published report + metrics]
B1 --> B2 --> B6
B3 --> B4 --> B5 --> B6
end
- Progress snapshot (front and center)
- Data flow: where we started -> where we are now
- Overview
- Why GHOST exists
- What we built
- How it works
- Detection design (reducing bias)
- Kubernetes-style structured signals
- Near-real stream & local adapters
- Validation & results
- Published reports
- Command reference
- Use cases
- Data: synthetic vs real logs
- Production & mission-critical systems
- Research: layered failures & learning
- Quick start
- Quick start by persona
- Help, FAQ & troubleshooting
- References & credits
- Project structure
- Documentation index
- License
GHOST is a reference implementation of a closed control loop:
log signal → structured event → policy lookup → corrective action → measured outcome
It targets workload-agnostic container runtime failure modes (OOM-style kills, crash loops, probe failures, latency thresholds) using explicit patterns and decision tables — not an LLM and not a third-party agent framework. Phase 1 runs entirely on your machine: synthetic logs, an in-memory service model, and a reproducible harness with SQLite metrics.
| Capability | Phase 1 |
|---|---|
| Real cloud / cluster APIs | No (simulated state) |
| LLM reasoning | No (deterministic matching) |
| External Python packages | No (standard library) |
| Repeatable experiment suite | Yes (harness.py — five experiments + SQLite metrics) |
| Policy separated from agent code | Yes (skills/ modules) |
| Integration contract gate | Yes (integrations/validate.py at start of harness.py) |
| Local file adapters (observe / lab) | Optional (adapters/ — not in CI) |
Containers fail when operators are not staring at dashboards. Logs often already contain the diagnosis; runbooks describe the fix. The weak link is frequently the latency and variance of the human chain: page → wake → context switch → manual execution.
GHOST answers one precise question from our engineering specification:
Can a lightweight system detect a known container runtime failure from a log stream and execute the correct corrective action faster and more reliably than a human — with zero human input after start?
We care because MTTR under automation is measurable. This repository isolates the autonomous loop so we can prove behavior and regression-test it before attaching real infrastructure, identity systems, or richer reasoning layers.
Concretely, this repository delivers:
| Layer | Implementation |
|---|---|
| Detection policy | skills/watcher_skills.py — substring sets per failure type, watched severities, event schema, explicit CANNOT_DO boundaries. |
| Remediation policy | skills/healer_skills.py — decision table (failure_type → action, params), timeouts, default unknown handler, outcome schema. |
| Watcher agent | agents/watcher.py — imports patterns only from watcher skills; emits validated events on ERROR / WARNING lines. |
| K8s signal policy | skills/k8s_signal_skills.py — ordered declarative rules on a signal object (record_type, phase, reason, etc.). |
| K8s signal agent | agents/k8s_watcher.py — imports only k8s_signal_skills; same event envelope as the log Watcher so the Healer stays unified. |
| Healer agent | agents/healer.py — imports the decision table only from healer skills; executes registered actions against shared state. |
| Event fabric | blackboard/event_bus.py — asyncio.Queue with schema validation (typed handoff between agents). |
| Simulated platform | simulator/infra_state.py — app-service baseline dict; container actions plus K8s-shaped fields (image, replicas_*, scheduling_blocked, node_ready) and matching heal actions. |
| Synthetic data | data/seed.py — log datasets, k8s_clean_signals.json, plus near_real_stream.json (200 noisy multi-line / kube-prefixed lines, 20 failures); outputs are gitignored. |
| Streaming | data/generator.py — async replay of JSON records for experiments. |
| Experiments | run_experiment1.py … run_experiment5.py — through mixed stream, K8s signals, and near-real noisy stream stress test. |
| Adapters (optional) | adapters/observe.py (Watcher-only file tail), adapters/lab_run.py (--dry-run or full loop on simulator). Not run in CI. |
| Lab data pipeline (optional) | lab/ + tools/ scripts: bootstrap/inject/collect/normalize/replay for external datasets; local only, not in CI. |
| Harness & metrics | harness.py + metrics/recorder.py — orchestrates all scenarios, prints a summary, persists rows to metrics/results.db. |
Design rule: agents never duplicate patterns or decision tables inline — skills are the single source of truth for review, diff, and compliance-style audits.
- Watcher scans each log record (optionally tagged with a stream index). If severity is in scope, it walks
DETECTABLE_PATTERNSin order and publishes one event on the first substring hit inmessage. - Healer awaits an event, resolves
(action, params)viaDECISION_TABLE(orDEFAULT_ACTION), runs the matching function inACTION_REGISTRYoninfra_state, then runsPOST_HEAL_VERIFIERSfromhealer_skills.pyon the updated state (unlessdry_runorlog_unknown). If the predicate fails,successis false even when the action raised no exception. Timing usesasyncio.wait_forper skill timeouts. For shadow / lab,heal_once(..., dry_run=True)skips mutation and skips verification. - Harness resets metrics DB, runs
integrations/validate.py(required paths + Hermes policy shape), then drives five experiments: log detection, log full loop, mixed stream (100/10), structured K8s-style signals (k8s_clean_signals.json), and near-real noisy stream (200/20,near_real_stream.json). On failure it exits non-zero (CI uses the same path).
flowchart TB
subgraph policy [Policy layer]
WSK[skills/watcher_skills.py]
HSK[skills/healer_skills.py]
end
subgraph runtime [Runtime loop]
JSON[Generated JSON logs]
W[Watcher]
Q[asyncio Queue]
H[Healer]
INFRA[infra_state]
DB[(metrics/results.db)]
end
WSK -.-> W
HSK -.-> H
JSON --> W
W --> Q
Q --> H
H --> INFRA
H --> DB
- Case-insensitive matching — Log lines are matched with Unicode casefold, and severities accept any casing (e.g.
error/ERROR). That avoids favoring one vendor’s capitalization (Kubernetes vs Docker vs PaaS logs). - Vendor-neutral phrases —
DETECTABLE_PATTERNSincludes multiple paraphrases per class (OOM / cgroup wording, crash-loop and backoff wording, probe and health-check failures, latency and timeout phrasing) so the PoC is not tuned to a single message shape. - Diverse synthetic failures —
data/seed.pypicks among several templates per failure type for clean and mixed datasets, so experiments are not overfit to four fixed strings. - Shared healthy check — The seed script uses the same
any_pattern_matches_message()helper as policy inwatcher_skills.py, so “no false patterns in healthy logs” is evaluated with the same rules as the Watcher (healthy lines were adjusted so phrases like “response time … within threshold” do not collide with latency rules once matching is case-insensitive).
First matching failure type in DETECTABLE_PATTERNS iteration order wins; patterns are ordered so higher-signal phrases are considered in a stable priority.
This is not a live cluster client: it is the same Watcher → Healer loop fed by JSON that resembles what you would derive from kube-apiserver watches (Pod/Node/Deployment-shaped objects).
| Synthetic class | Typical real-world analogue | Simulated heal |
|---|---|---|
ImagePullBackOff / ErrImagePull |
Bad image tag, registry auth | Roll back to image_previous |
SchedulingBlocked |
FailedScheduling (resources, taints) |
Clear scheduling_blocked |
NodeNotReady |
Node condition NotReady | Set node_ready |
ReplicaMismatch |
Deployment ready ≠ desired | sync_replicas |
PodDown (Evicted) |
Pod Failed + evicted / node pressure |
restore_workload |
Why this matters: log substring matching alone is biased toward whatever format your app prints. Production agents usually combine typed API objects + events + metrics. Experiment 4 is a stdlib-only stepping stone: swap signal ingestion for an informer later without changing the Healer contract.
Experiment 5 replays near_real_stream.json (from seed.py): 200 records with kube-style timestamps, optional multi-line / stack-ish prefixes, and sometimes JSON-shaped log lines; 20 failures are shuffled among 180 healthy records. It applies the same scoring rules as Experiment 3 (detect / false positives / resolve vs near_real_ground_truth.json). This is still synthetic text — it stress-tests the current substring policy, not your production corpus.
Adapters (under adapters/) are optional tools for local workflows and are not executed in CI:
| Script | Purpose |
|---|---|
adapters/observe.py |
JSON array file → Watcher only → JSONL detection lines (no Healer). |
adapters/lab_run.py |
Same file → Watcher + Healer on the simulator; use --dry-run to skip ACTION_REGISTRY side effects. |
For rollout tiers, charter, and game-day checklist (process only), see docs/GOVERNANCE.md.
For higher-fidelity local data without expanding CI scope, this repo includes a minimal pipeline:
- Bootstrap lab and deploy a test workload (
lab/bootstrap_lab.ps1). - Inject deterministic failures (
lab/inject_failures.ps1). - Collect events/logs (
tools/collect_k8s_lab_data.py). - Normalize to GHOST replay shape + ground truth (
tools/normalize_external_capture.py). - Score with the same Watcher/Healer loop (
tools/run_external_replay.py→experiments/run_experiment_external.py).
One-command wrapper (PowerShell): lab/collect_and_normalize.ps1.
This path is local-only and not wired into harness.py or CI.
Latest published local lab run report: docs/LAB_RUN_REPORT_20260331.md.
Continuous integration: every push and pull request to main runs seed.py and harness.py on Python 3.11 via GitHub Actions (see the CI badge at the top). harness.py first runs integrations/validate.py (stdlib check for contract files and core paths).
Locally, the same commands execute:
| Experiment | What it proves | Expected outcome |
|---|---|---|
| 1 — Detection | Watcher finds all four failure types on clean logs | 4 / 4 scenarios PASS |
| 2 — Full loop | Healer applies correct mutations after each clean failure (infra reset per scenario) | 4 / 4 assertions PASS (memory, port, instances, restart semantics) |
| 3 — Mixed stream | 100 lines: 90 healthy + 10 injected failures | 10 / 10 detected, 0 false positives on healthy lines, 10 / 10 resolved vs ground truth |
| 4 — K8s signals | 6 structured signal records (2× image pull paths + scheduling + node + replicas + evicted pod) |
6 / 6 PASS |
| 5 — Near-real noisy stream | 200 lines: 180 healthy + 20 injected failures (noisy envelopes) | 20 / 20 detected, 0 false positives on healthy lines, 20 / 20 resolved vs ground truth |
Timing: On fast local hardware, reported detect/decide/act milliseconds may round to 0 ms; correctness is enforced by assertions, not wall-clock drama. Add delays in the generator or real I/O when you need representative latency distributions.
All runs append structured rows to metrics/results.db for downstream reporting or dashboards. Each successful harness run also appends a JSON summary to feedback_rows (policy versions, per-experiment pass flags, Experiment 3 and 5 counts) via metrics/feedback.py.
The repo now includes an executed lab report with concrete artifact paths and replay metrics:
Latest published highlights from that report:
- Initial external replay on captured lab data:
detected 6/11,resolved 6/11,false_positives 0. - Follow-up normalization fix (
BackOffpull-image mapping) on the same run:detected 11/11,resolved 11/11,false_positives 0. - Production meaning: recall bottleneck was in normalization semantics, not in the core Watcher/Healer execution path.
Run all commands from repository root.
| Goal | Command |
|---|---|
| Generate all synthetic datasets | python data/seed.py |
| Run full CI-equivalent harness | python harness.py |
| Watcher-only on a file stream | python adapters/observe.py data/mixed_stream.json |
| Full loop in dry-run mode | python adapters/lab_run.py --dry-run data/near_real_stream.json |
| Full loop with simulator mutation | python adapters/lab_run.py data/mixed_stream.json |
| Bootstrap local K8s lab | ./lab/bootstrap_lab.ps1 |
| Inject deterministic lab failures | ./lab/inject_failures.ps1 |
| Collect + normalize + replay lab data | ./lab/collect_and_normalize.ps1 |
| Manual external replay | python tools/run_external_replay.py --data data/external/runs/<run-id>/normalized.json --ground-truth data/external/runs/<run-id>/ground_truth.json --record |
| Validate integration contract files only | python integrations/validate.py |
| Use case | What to run | Output / decision value |
|---|---|---|
| Validate policy correctness before any infra work | python data/seed.py then python harness.py |
Reproducible pass/fail across five experiments; blocks policy regressions early. |
| Observe-only triage on captured logs | python adapters/observe.py <path-to-json-array> |
Detection events only; no state mutation; safe for shadow analysis. |
| Dry-run autonomous response rehearsal | python adapters/lab_run.py --dry-run <path-to-json-array> |
End-to-end detect/decide trace without applying actions. |
| Evaluate action correctness in simulator | python adapters/lab_run.py <path-to-json-array> |
Simulated state transitions + post-heal verification outcomes. |
| Reproduce Kubernetes-style incidents locally | ./lab/bootstrap_lab.ps1, ./lab/inject_failures.ps1, ./lab/collect_and_normalize.ps1 |
Captured artifacts + replay score on near-real local signals. |
| Measure external replay quality over time | python tools/run_external_replay.py ... --record |
Detection/precision/resolution metrics appended for trend tracking. |
| Author policy updates safely | Edit skills/ + run python harness.py + python integrations/validate.py |
Enforces skills-as-policy boundary and integration contract completeness. |
| Prepare production rollout process | Fill docs/GOVERNANCE.md |
Defines autonomy tiers, blast radius, and change control before live execution. |
Nothing stops you from using real or open-source log data — the project ships synthetic JSON by default for four practical reasons:
| Reason | Detail |
|---|---|
| Reproducibility | CI and contributors need identical inputs; pinned synthetic output from seed.py guarantees that. |
| Safety | Production logs routinely contain secrets, PII, and internal hostnames — they must not land in a public git history. |
| Licensing | Public “open” log corpora still carry terms (attribution, research-only, no redistribution). Compliance is your obligation when you import them. |
| Schema & labels | GHOST experiments expect structured records and (for scoring) known failure classes. Raw downloads need ETL and often manual or semi-automatic labeling. |
“Training” in this PoC does not mean neural-network training. The agents are explicit policies (substring / ordered rules + decision tables). Improving them means engineering: extend skills/watcher_skills.py and skills/k8s_signal_skills.py, validate with harness.py. A future ML layer would be a separate pipeline with its own data governance.
Where to put optional real or redacted samples locally: data/external/README.md — files there stay out of git by default (except that README). See the full operational guide in docs/HELP.md (FAQ: Can we download real scenarios from open-source log providers?).
GHOST Phase 1 is a laboratory instrument, not a production controller. The ideas it embodies, however, map directly to how serious teams introduce automation safely.
What transfers well
- Explicit policy (versioned patterns + action tables) with separation from execution code — supports review, RBAC on changes, and post-incident audit (“what could the robot do?”).
- Closed-loop tests before prod: the same structure you see in Exp 2–3 should eventually run against staging APIs with frozen golden logs and expected state transitions.
- Fast, bounded remediation for known classes: restarts within caps, scale-out within limits, cache clears — actions that are reversible and idempotent when designed well.
What production must add
| Risk in naive automation | Mitigation in mission-critical environments |
|---|---|
| Log substring false positives | Structured signals, alert correlation, rate limits, dry-run / canary, human approval for destructive classes |
| Blast radius | Hard quotas, multi-account isolation, circuit breakers, automatic rollback hooks |
| Unknown / correlated failures | Escalation paths, SLO-based policy, runbook coverage; LLM/heuristics after guardrails and retrieval — not instead of deterministic paths |
| Governance | IAM-bound actions, change windows, immutable audit trail, integration with ticketing and postmortems |
Practical tiers (how organizations usually evolve)
- Assisted ops — automation gathers context and proposes steps; humans execute risky changes.
- Guardrailed autonomy — small set of low-blast, reversible actions with hard caps and shadow mode first.
- Expanded policy — broader coverage only where harnesses and game days prove safety.
A fill-in template aligned to these ideas (charter, tier definitions, blast radius, drills) lives in docs/GOVERNANCE.md.
Bottom line: GHOST demonstrates that a deterministic autonomous loop can be built clearly and tested. For mission-critical workloads, the long-term value is shorter MTTR on known paths and less cognitive load on operators — provided automation is constrained, observable, and never the only line of defense.
Today’s PoC is intentionally small. The next step toward human-like troubleshooting under incomplete information is to reason across layers (logs, manifests, network, APIs, data) with specialist agents and a coordinator, not a single log grep.
docs/VISION_LAYERED_LEARNING.md— layered failure model, partial observability, swarm-style roles (Hermes-like orchestration without claiming a product), topology-aware bias, an honest taxonomy of feedback loops, and how external development tooling (e.g. gstack) fits next to GHOST as policy/code authoring support—not unguarded prod operators.metrics/feedback.py— after eachharness.pyrun, an append-onlyfeedback_rowsrecord is stored inmetrics/results.dbwith pass/fail flags for all five experiments, layer tags (includinglog_near_real_noisy), and policy (skills) versions so batch jobs can correlate outcomes with policy state (hook for offline policy improvement — not online learning in agents).
Agents here do not perform online gradient descent; “learning” means closing the loop from verified outcomes into policy updates you promote through tests.
git clone https://github.com/beejak/GHOST-PoC.git
cd GHOST-PoC
python data/seed.py
python harness.pyImportant: Generated JSON under data/ is not committed (see .gitignore). Always run seed.py after a fresh clone before harness.py.
Optional: python data/seed.py --seed 123 — different shuffle of failures inside the mixed stream.
Runtime: Python 3.11+ recommended; 3.9+ may work with the current codebase. Phase 1 requires no pip install.
| Persona | Fastest path | Why this path |
|---|---|---|
| Operator / SRE evaluator | python data/seed.py -> python harness.py |
Confirms baseline policy correctness before touching any lab tooling. |
| Policy author (skills editor) | Edit skills/ -> python data/seed.py -> python harness.py |
Ensures every rule change is validated across all five experiments. |
| Shadow-mode reviewer | python adapters/observe.py data/mixed_stream.json |
Lets you inspect detections without mutation side effects. |
| Autonomy rehearsal owner | python adapters/lab_run.py --dry-run data/near_real_stream.json |
Exercises full detect/decide flow while staying non-destructive. |
| Lab pipeline engineer | ./lab/bootstrap_lab.ps1 -> ./lab/inject_failures.ps1 -> ./lab/collect_and_normalize.ps1 |
Produces replayable local K8s-derived data with measurable outcomes. |
| Governance / risk lead | Read docs/GOVERNANCE.md + docs/LAB_RUN_REPORT_20260331.md |
Maps technical results to rollout tiers, blast radius, and controls. |
| Question | Short answer |
|---|---|
| Why no real logs in the repo? | Reproducibility, CI, licensing, and secret/PII risk — see Data: synthetic vs real above. |
| Can we download open-source log datasets to “train” the agents? | Yes, locally, if the license fits your use case. Today’s agents are rule-based; you refine skills and re-run the harness, not a model trainer. Normalize into the same JSON shape as generated clean_failures.json. |
| Harness failed on experiment N | Re-run python data/seed.py. If it persists, open docs/HELP.md → Quick troubleshooting and match the error pattern. Integration contract validation failed means integrations/validate.py exited non-zero (missing contract file or expected path). |
| Where is detailed help? | docs/HELP.md — troubleshooting table, extended FAQ, extension patterns, support pointers. |
| How do I change detection or healing? | Only via skills/ and simulator/infra_state.py; never duplicate tables inside agents/. |
Common fixes
FileNotFoundErrorondata/*.json→ runpython data/seed.pyfrom the repository root.- Healthy baseline assertion failed → template overlap with patterns; adjust
data/seed.pyorskills/watcher_skills.py. - All timings
0 ms→ expected on fast CPUs; assertions still prove correctness.
For incident-style walkthroughs, licensing notes, and a path to optional data/external/ workflows, read docs/HELP.md.
Attribution for external repositories and upstream projects referenced by this PoC:
| Project | Link | How it is used here |
|---|---|---|
| GHOST-PoC (this repository) | beejak/GHOST-PoC | Primary implementation, experiments, docs, and CI. |
| gstack | garrytan/gstack | Referenced for skill-oriented AI development workflows; integrated via compatibility docs and maintainer skill patterns under integrations/gstack/. |
| Hermes Agent (Nous Research) | NousResearch/hermes-agent | Referenced as an optional external agent runtime; this repo ships only integration contracts/policies in integrations/hermes/, not Hermes runtime code. |
Notes on scope and credit:
- This repo does not vendor third-party runtime source from gstack or Hermes.
- Integration is contract-based (policy files, prompts, maintainer guidance), with explicit upstream links for install and licensing.
- If additional external repos are adopted later, add them here with link, license, and exact usage boundary.
GHOST-PoC/
├── docs/
│ ├── HELP.md # In-depth help, FAQ, real-log guidance, troubleshooting
│ └── GOVERNANCE.md # Template: tiers, charter, game days (org process; not enforced in code)
├── adapters/ # Optional observe / lab_run (local files → agents)
├── lab/ # Optional K8s lab scripts/manifests (bootstrap/inject/pipeline wrapper)
├── tools/ # External data pipeline scripts (collect/normalize/replay)
├── integrations/ # Hermes + gstack-compatible contracts; validate.py (no LLM in CI)
├── skills/ # Policy: log patterns, K8s signal rules, decision table
├── agents/ # Watcher, K8s watcher & Healer (import skills only)
├── blackboard/ # Event bus (asyncio queue + validation)
├── simulator/ # Fake infra state + action implementations
├── data/
│ ├── seed.py # Synthetic dataset generator
│ ├── generator.py # Async JSON stream for harness
│ ├── scenarios.json # Scenario metadata
│ └── external/ # Gitignored drops for redacted real samples (README only in git)
├── experiments/ # Experiment 1–5 runners
├── metrics/ # SQLite recorder, reporter, harness feedback ledger
├── harness.py # Single entrypoint: all experiments
├── Ghost PoC.md.txt # Full build specification
├── README.md
├── LICENSE
└── requirements.txt # Phase 2 placeholders only
| Document | When to read it |
|---|---|
| README.md (this file) | First-time orientation, architecture, validation summary, quick start. |
| docs/HELP.md | World-class operational help: troubleshooting matrix, full FAQ (including real vs synthetic logs), extension guide, support. |
| docs/VISION_LAYERED_LEARNING.md | Research architecture: layered failures, partial info, runtime swarm pattern, feedback roadmap, virtual dev team vs GHOST boundary. |
| docs/GOVERNANCE.md | Rollout template: autonomy tiers, policy change control, blast radius, game days (fill in for your org). |
| docs/LAB_RUN_REPORT_20260331.md | First executed local lab pipeline report (artifacts + replay metrics). |
| Ghost PoC.md.txt | Formal specification, definition of done, build order, synthetic vs real appendix. |
| data/external/README.md | Where optional local / redacted corpora go and what not to commit. |
| lab/README.md | Minimal lab workflow to generate external data and replay it locally. |
| tools/README.md | Collect / normalize / replay scripts for external datasets. |
| integrations/README.md | Hermes (Nous) tool policy + maintainer skill aligned with gstack; validate.py runs inside harness.py. |
| integrations/hermes/README.md | Installing Hermes upstream; mapping TOOL_POLICY.json to your tool config. |
| integrations/gstack/README.md | Vendoring / using the gstack-compatible maintainer skill next to upstream gstack. |
Licensed under the Apache License 2.0 — see LICENSE.
GHOST · Prove the loop in the lab. Earn the right to run it in production.
If you extend this work, preserve the skills-as-policy pattern — it is the primary maintainability and auditability lesson from Phase 1.