Skip to content

Make provenance and evidence traceability first-class for maestro (maestro multi-model coding tui web) #384

@haasonsaas

Description

@haasonsaas

Summary

Carry source, decision, and output provenance through the main workflow so downstream agents can audit and cite it.

This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.

Repo Evidence

  • Repository description: Maestro — multi-model coding agent with TUI, web, IDE, Slack, and GitHub interfaces
  • Tree signals: 79 docs files, 19 workflows, 1 proto files, 737 test-like files.
  • AGENTS.md:31 includes latent-spec language: Critical: Consult .github/workflows/ (evals.yml, nx-ci.yml, release.yml) to mirror CI environments.
  • AGENTS.md:36 includes latent-spec language: * Full Test Suite: npx nx run maestro:test --skip-nx-cache (Builds tui + maestro-web automatically). Run after every code change. * Linting: bun run bun:lint (Biome + Eval Verifier). Run after every code change. * Runtime Commands: Avoid long-lived dev/watch servers (e.g., npm run dev) unless th
  • AGENTS.md:54 includes latent-spec language: 2. Never force-push to main. This rewrites shared history and breaks collaborators. 3. Atomic commits only. Each commit should be one logical change. Don't mix unrelated changes. 4. Never use --force or --force-with-lease on shared branches.
  • AGENTS.md:187 includes latent-spec language: 4. Stop tasks when done - they'll auto-cleanup on Maestro exit, but explicit stops are cleaner 5. Use restart policies for resilient services - ideal for dev servers that should recover from crashes 6. Direct execution is safer - omit shell parameter for simple commands without pipes
  • AGENTS.md:374 includes latent-spec language: 4. Wire handler in src/cli-tui/tui-renderer.ts: ```typescript
  • CLAUDE.md:31 includes latent-spec language: Critical: Consult .github/workflows/ (evals.yml, nx-ci.yml, release.yml) to mirror CI environments.

Research Grounding

Repo axes: tooling, evaluation, security, desktop

Search keywords: run, maestro, use, dev, bun, never, servers, commands, test, build, command, npx

  • arXiv:2508.07575v1 MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark (Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo), 2025.
  • arXiv:2603.24943v1 FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol (Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li), 2026.
  • arXiv:2508.12566v1 Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models (Wei Song, Haonan Zhong, Ziqi Ding, Jingling Xue, Yuekang Li), 2025.
  • arXiv:2602.01129v1 SMCP: Secure Model Context Protocol (Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu), 2026.
  • arXiv:2603.00123v1 CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers (Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang), 2026.
  • arXiv:2507.19570v1 MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation with Backend Aware Synthesis Optimization (Yiting Wang, Wanghao Ye, Yexiao He, Yiran Chen, Gang Qu, Ang Li), 2025.
  • arXiv:2604.13849v1 MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems (Yi Ting Shen, Kentaroh Toyoda, Alex Leung), 2026.
  • arXiv:2506.14683v2 Unified Software Engineering Agent as AI Software Engineer (Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, Abhik Roychoudhury), 2025.
  • arXiv:2503.23803v2 Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute (Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen), 2025.
  • arXiv:2506.19998v1 Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation (Xinyi Ni, Haonan Jian, Qiuyang Wang, Vedanshi Chetan Shah, Pengyu Hong), 2025.

What To Build

  • Add stable identifiers for source records, derived decisions, and emitted outputs.
  • Thread those identifiers through logs/events/API responses without leaking secrets.
  • Provide a query or debug surface that reconstructs the chain for one completed workflow.

Acceptance Criteria

  • A short design note names the repo-specific workflow, threat or correctness model, and the research assumptions being adopted.
  • A runnable check, fixture, or verifier exercises the new contract in CI or an equivalent local command documented in the repo.
  • The implementation emits or stores enough evidence for a downstream agent/operator to cite inputs, decisions, and outputs.
  • At least one negative/degraded-mode case is covered so failures are observable rather than silently accepted.
  • Documentation links the new behavior to the relevant EvalOps platform primitive or explicitly records why this repo remains standalone.

Notes

  • Generated issue 2/5 for evalops/maestro by evalops_org_miner.py.
  • Before implementation, confirm the sampled latent-spec snippets still match main; this issue intentionally cites exact file paths/lines where the mining pass saw them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions