Recorded Auditability and logging for expert and no expert communication for processed flow

### Is this a new feature, an improvement, or a change to existing functionality?

Improvement

### How would you describe the priority of this feature request

High

### Please provide a clear description of problem this feature solves

This is a framework proposal, not code. Current state doesn't have auditability or task tracking, which is essential for agent testing.

Parallel execution agent management.

  Task Bus (bus.py) — State Machine

  queued → running → completed/failed/aborted/timed_out
  Every state transition:
  1. SET hcli:task:{id}:state with TTL
  2. PUBLISH hcli:task:{id}:notify for subscribers
  3. HMAC-SHA256 signed results to prevent spoofing

  Dispatcher (dispatcher.py) — Concurrent Execution

  - ThreadPoolExecutor with semaphore gating (MAX_CONCURRENT_TASKS)
  - Per-chat serialization — tasks from the same chat are serialized via locks to protect session history
  - BLPOP on Redis queue with backpressure (semaphore acquired before popping)
  - Heartbeat file for health monitoring
  - Graceful SIGTERM shutdown with in-flight task drain

  Worker (worker.py) — Task Execution with Full Context

  - Builds per-task system prompts with session memory + skill injection
  - Spawns Claude as a subprocess with start_new_session=True (own process group for clean kill)
  - Abort mechanism: Redis pub/sub control channel per task, listener thread that os.killpg() on abort
  - Timeout enforcement (TASK_TIMEOUT, default 600s)
  - Session chunking to disk when size exceeds MAX_SESSION_BYTES
  - Conversation history tracking in Redis with idle sweep

  The ANNOUNCEMENT/REPLY Protocol (development process)

  This is separate from the runtime — it's how the codebase itself was built. The docs/decisions/ directory contains the actual
  artifacts:

  - architect-report-redis.md — The architect agent's report after 3 discussion rounds with 4 expert teams
  - core-reply.md — Core team's analysis of how the dispatcher split affects their MCP contract
  - orchestration-reply.md — Orchestration team's analysis of why single-container is correct
  - interface-reply.md, llm-reply.md — Other teams' responses

  These show the actual protocol in action:
  1. Architect pushes ANNOUNCEMENT.md to each team's branch
  2. Each team analyzes independently, pushes REPLY.md
  3. Architect synthesizes into architectural decisions (AD-1 through AD-12)
  4. Contracts between teams are documented explicitly

  For example, the architect-report shows how 4 AI expert teams debated MCP-over-Redis across 3 rounds and ultimately rejected it
  (AD-12), with clear reasoning from each team about risks, call chains, and container topology.

  ---
  The Fundamental Gap

  ┌─────────────────────────────┬────────────────────────────────────────┬────────────────────────────────────────────────────────┐
  │         Capability          │           NeMo Agent Toolkit           │                         h-cli                          │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Parallel execution          │ asyncio.gather() on tool calls         │ ThreadPoolExecutor + semaphore + per-chat locks        │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Task state tracking         │ None — fire and forget                 │ Full state machine with Redis persistence + TTLs       │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Result integrity            │ Plain string concatenation             │ HMAC-SHA256 signed results                             │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Abort/cancel                │ Not supported                          │ Redis control channel + os.killpg()                    │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Crash recovery              │ Not supported                          │ Startup scan marks orphaned running → failed           │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Agent coordination protocol │ LLM chooses next tool via ReAct prompt │ Git branches + Redis pub/sub + ANNOUNCEMENT/REPLY docs │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Session continuity          │ None between agent calls               │ Redis session history + disk chunking + idle sweep     │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Health monitoring           │ None                                   │ Heartbeat file + Docker healthcheck                    │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Observability               │ logger.info() with duration            │ Redis counters + TimescaleDB + Grafana                 │
  ├─────────────────────────────┼────────────────────────────────────────┼────────────────────────────────────────────────────────┤
  │ Notification                │ None                                   │ PUBLISH on every state transition                      │
  └─────────────────────────────┴────────────────────────────────────────┴────────────────────────────────────────────────────────┘


### Describe your ideal solution

## The Setup

One tmux session. One operator. One architect agent. Eight expert teams — each an independent Claude instance in its own tmux pane.

```mermaid
flowchart LR
    OP["Operator\n(human)"] -->|"direction"| AR["Architect\n(Claude)"]
    AR -->|"tasks via\ngit + Redis"| teams
    teams -->|"branches +\ndone signals"| AR
    AR -->|"results"| OP

    subgraph teams ["Expert Teams — each a separate Claude instance"]
        T1["orchestration"] ~~~ T2["interface"] ~~~ T3["core"] ~~~ T4["llm"]
        T5["monitor"] ~~~ T6["hssh"] ~~~ T7["knowledge"] ~~~ T8["security"]
    end

    style OP fill:#c62828,color:#fff,stroke:#b71c1c
    style AR fill:#1565c0,color:#fff,stroke:#0d47a1
    style teams fill:#1a1a2e,color:#e0e0e0,stroke:#4a4a6a
```

## How It Works

The operator tells the architect what to build. The architect breaks it into scoped tasks, writes an `ANNOUNCEMENT.md` for each team, pushes branches, and notifies teams via Redis. Each expert reads its task, implements it in its own directory, pushes a branch with a `REPLY.md`, and signals done. The architect reviews, merges, and reports back.

No expert ever talks to another expert. All coordination flows through the architect.

```mermaid
sequenceDiagram
    participant O as Operator
    participant A as Architect
    participant R as Redis
    participant E1 as Core Team
    participant E2 as Interface Team

    O->>A: "Add output sanitization"
    A->>A: Create branches with ANNOUNCEMENT.md
    A->>R: PUBLISH round "core interface"
    A->>R: PUBLISH msg:core "pull branch, check ANNOUNCEMENT.md"
    A->>R: PUBLISH msg:interface "pull branch, check ANNOUNCEMENT.md"
    R->>E1: Task notification
    R->>E2: Task notification

    par Parallel execution
        E1->>E1: Read task, implement, push branch
        E2->>E2: Read task, implement, push branch
    end

    E1->>R: PUBLISH done "core"
    E2->>R: PUBLISH done "interface"
    R->>A: "All teams done: core interface"
    A->>A: Review branches, merge to main
    A->>O: Done — here's what changed
```

## The tmux Layout

```
┌─────────────────────────────────────────────────────┐
│ h-cli-development:architect    ← Architect agent    │
├─────────────────────────────────────────────────────┤
│ h-cli-development:orchestration                     │
│ h-cli-development:interface                         │
│ h-cli-development:core                              │
│ h-cli-development:llm          ← Expert agents      │
│ h-cli-development:monitor        (one per pane)     │
│ h-cli-development:hssh                              │
│ h-cli-development:knowledge                         │
│ h-cli-development:security                          │
├─────────────────────────────────────────────────────┤
│ h-cli-development:redis        ← Conductor          │
└─────────────────────────────────────────────────────┘
```

Each pane is a separate Claude Code instance. They share only the git repo and a Redis instance. The conductor (a small shell script on the Redis pane) tracks which teams have signaled done and notifies the architect when a round is complete.

## The Rules

Strict conventions prevent chaos:

- **Experts stay in scope.** Each team owns one directory. Edits outside it are forbidden unless the task explicitly allows it.
- **Communication is async.** Tasks go out via git branches + Redis. Results come back via git branches + Redis. No shared state, no direct messaging.
- **Rounds are atomic.** The architect declares a round, all teams execute in parallel, all signal done, then the architect merges. No partial merges mid-round.
- **Main stays clean.** No communication artifacts (`ANNOUNCEMENT.md`, `REPLY.md`) reach the main branch. The architect strips them during merge.
- **Pull before work.** Every team pulls main before starting its task branch — prevents divergence.
- **Push before signal.** A "done" signal without a pushed branch is useless. Push first, signal second.

## What This Means

The entire codebase — 12 Docker services, 45 security hardening items, two network topologies, an Asimov-inspired AI firewall, session management, skill teaching, vector memory, and monitoring — was built through this process. One human steering, AI agents executing in parallel, strict protocols preventing them from stepping on each other.

The operator never wrote code. The architect never read implementation details. The experts never coordinated directly. Each role stayed in its lane, and the system grew commit by commit.

(private)670+ commits. Zero merge conflicts from scope violations (after the first week).

### Additional context

https://github.com/h-network/h-cli/blob/main/docs/H-CLI-DEVELOPMENT-EXPLAINED.md

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct
- [x] I have searched the [open feature requests](https://github.com/NVIDIA/NeMo-Agent-Toolkit/issues?q=is%3Aopen+is%3Aissue+label%3A%22feature+request%22%2Cimprovement%2Cenhancement) and have found no duplicates for this feature request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recorded Auditability and logging for expert and no expert communication for processed flow #1793

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

The Setup

How It Works

The tmux Layout

The Rules

What This Means

Additional context

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recorded Auditability and logging for expert and no expert communication for processed flow #1793

Description

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

The Setup

How It Works

The tmux Layout

The Rules

What This Means

Additional context

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions