This roadmap reflects current priorities and is updated as the project evolves. Items are roughly ordered by priority within each phase. For detailed proposals, see the
design/folder and the RFC process in CONTRIBUTING.md.
Goal: stable core protocol, first wave of cubes, compliance tooling.
- Core protocol:
Tool,Task,Benchmark,Observation,Action -
cube init/cube testCLI - Reference implementation:
counter-cube - Container backends (Docker, Modal, Daytona)
- First cubes landing:
- Web agents: MiniWob ✅, WebArena-Verified ✅ (cube-harness#214), WorkArena ✅
- Computer use (CUA): OSWorld ✅
- SWE: SWE-bench Verified + Live ✅, TerminalBench 2 ✅, LiveCodeBench ✅
- Benchmark metadata schema —
BenchmarkMetadatafields: homepage, citation, license, task count, modality (benchmark.py) - CUBE Stress Test — compliance checks and latency suite (`cube test cube-name) — nearly complete, see PR #22
- Unified resource backend —
VMBackend/VMabstraction for cloud and local VM provisioning (design/vm_backend.md) - Stable
v0.1API — freeze core interfaces, tag release - PyPI publication (
cube-standard) - Published documentation site
Goal: integrate with major agent frameworks, grow to ~50 cubes.
- NemoGym integration — bidirectional: run CUBE cubes from NemoGym, expose NemoGym envs as cubes
- AgentBeats integration — leaderboard and evaluation pipeline connected to CUBE
- Other platform integrations — ongoing discussions with framework maintainers
- ~50 cubes, growing across categories
- RFC: streaming observations (
design/rfc_core_extensions.md) - RFC: better async task execution (
design/rfc_core_extensions.md) - RFC: multi-agent support (
design/rfc_core_extensions.md) - RFC: multi-dimensional rewards (
design/rfc_core_extensions.md)
Goal: CUBE becomes the default interoperability layer for agent benchmarks. Exact scope TBD — to be discussed with the community.
- Large-scale cube registry — community-maintained index of CUBE-compatible benchmarks
- Cube discovery and install (
cube add <benchmark>) - Broader platform integrations (beyond Phase 2)
- Number of cubes: open-ended, driven by community adoption
Phase 3 priorities will be shaped by what the community builds in Phase 2. Join the discussions to help define it.
Have an idea that changes the core protocol? Open a GitHub Discussion or file a PR against design/. See CONTRIBUTING.md for the full process.
- Comment on existing GitHub Issues or open a new one
- Start a GitHub Discussion
- Submit an RFC draft in
design/via PR - Propose a benchmark for wrapping — flag a benchmark you'd like to see as a CUBE, or contribute one yourself
- Apply as a core contributor to help shape priorities directly