Open source harness for building and evaluating AI agents using the CUBE Standard.
CUBE Standard defines the benchmark protocol. cube-harness is the evaluation runtime: it runs agents against any CUBE-compatible benchmark, records trajectories, and scales execution with Ray.
Note
cube-harness is in active development (alpha). Interfaces may change. We welcome early adopters and contributors who want to shape the framework, not just use it. See our Roadmap and Contributing Guide.
Have a benchmark to contribute? Fill out this short form — no commitment required. Want to go deeper? Apply to join the core team.
# Clone the repository
git clone https://github.com/The-AI-Alliance/cube-harness.git
cd cube-harness
# Install dependencies
make installSet your OpenAI API key:
export OPENAI_API_KEY=your-key-hereAny LiteLLM-supported provider works — just change model_name in the recipe.
make testThe hello_miniwob recipe demonstrates running a ReAct agent on the MiniWob benchmark.
Start here — 2 tasks, sequential (fast, no Ray required):
make debugFull benchmark (all 125 tasks, parallel via Ray):
make helloThis will:
- Launch a headless browser environment
- Run a ReAct agent powered by gpt-5-mini on MiniWob tasks
- Save trajectories and results to
~/cube_harness_results/{YYYYMMDD_HHMMSS}_react_miniwob/
Recipes are the configuration. Copy one from recipes/, edit what you need, and run it. Config objects are typed Pydantic models — serialized to disk with every experiment so results are always reproducible.
See docs/configuration.md for the full philosophy, a comparison with Hydra/YAML/CLI approaches, and how to run sweeps.
cube-harness includes a Gradio-based XRay UI for exploring experiment results, trajectories, and OpenTelemetry spans:
make xray
# or: uv run ch-xrayThe viewer displays:
- Trajectory list — all runs with task ID, steps, reward, and duration
- Visual timeline — color-coded steps (blue=environment, green=agent) with duration-based widths
- Screenshots — environment state at each step
- Step details — observations, agent actions, and LLM reasoning
- Debug data — raw JSON, LLM calls, and tool configurations
cube-harness is a universal evaluation platform for agentic benchmarks and an RL data generation framework built on top of the CUBE Standard.
- Agent — LLM-powered decision maker that receives observations and produces actions
- Environment — Executes actions, provides observations and rewards (tool + task composition)
- Tool — Modular action provider that exposes an action space, reusable across benchmarks
- ActionSpace — Defines the set of possible actions a tool can execute
- Task — Defines goals, validation logic, and action subsets
- Trajectory — Stores interaction history (observations, actions, rewards)
- Episode — Single agent-environment loop for one task; records a trajectory
- Benchmark — Collection of tasks; produces env configs for episodes
- Experiment — Coordinates execution of multiple episodes across a benchmark
- ExpRunner — Execution runtime (sequential or parallel via Ray)
- Benchmark Agnostic — Plug in any CUBE-standard benchmark (MiniWob, WebArena, OSWorld, …) via the
Benchmarkinterface - Agent Agnostic — Support any agent architecture by implementing the
Agentprotocol - RL-Ready — Trajectory format designed for training data generation with full LLM call logging
- Scalable — Ray integration for parallel episode execution across multiple workers
- Observable — Structured trajectory output for analysis and debugging
make format # Format code
make lint # Lint and auto-fix
make help # Show all commands
make test # Run tests
make coverage # Run tests with coverage reportInstall once after cloning to get ruff lint/format, trailing-whitespace checks, and DCO sign-off enforcement on every commit:
pre-commit install --hook-type pre-commit --hook-type commit-msg --hook-type prepare-commit-msgThe prepare-commit-msg hook automatically appends Signed-off-by: Your Name <email> to every commit message (required by the DCO). You can also sign off manually with git commit -s.
cube-harness/
├── src/cube_harness/ # Source code for the framework
├── tests/ # Test suite
├── recipes/ # Example recipes and configurations
├── docs/ # Project documentation
└── Makefile # Common task shortcuts
All contributions are welcome — open an issue, submit a PR, or wrap a new benchmark. See CONTRIBUTING.md for the development guide, DCO requirements, and RFC process.
Want deeper involvement? Join the core team, shape the roadmap, and get credit for what you build. Apply here.
For general AI Alliance contribution guidelines, see the community repo.


