|
| 1 | +# Enterprise Architecture |
| 2 | + |
| 3 | +## Purpose |
| 4 | + |
| 5 | +This document defines the target architecture when `codex-telegram-claws` is deployed as a financial enterprise engineering assistant for multiple subsidiary CTO teams. The current repository is a strong single-host beta. The enterprise target is a controlled multi-host platform. |
| 6 | + |
| 7 | +## Target Operating Model |
| 8 | + |
| 9 | +- One central Telegram control plane, owned by the platform team. |
| 10 | +- One worker agent per company, business unit, or regulated environment. |
| 11 | +- Each worker runs close to its own repositories, Codex CLI, MCP servers, and secrets. |
| 12 | +- The control plane never executes local shell or git actions directly against remote business units. |
| 13 | + |
| 14 | +## Logical Architecture |
| 15 | + |
| 16 | +```text |
| 17 | +Telegram User |
| 18 | + -> Control Plane API / Bot Gateway |
| 19 | + -> Identity + RBAC + Policy Engine |
| 20 | + -> Audit Log + Event Bus |
| 21 | + -> Worker Registry |
| 22 | + -> Subsidiary Worker A |
| 23 | + -> Codex CLI |
| 24 | + -> MCP Servers |
| 25 | + -> Git / CI / Safe Shell |
| 26 | + -> Subsidiary Worker B |
| 27 | + -> Subsidiary Worker C |
| 28 | +``` |
| 29 | + |
| 30 | +## Core Components |
| 31 | + |
| 32 | +### Control Plane |
| 33 | + |
| 34 | +- Terminates Telegram traffic and normalizes commands. |
| 35 | +- Resolves tenant, user role, target worker, and policy set. |
| 36 | +- Stores chat state, project selection, model override, and approval state. |
| 37 | +- Emits immutable audit events for every privileged operation. |
| 38 | + |
| 39 | +### Worker Agent |
| 40 | + |
| 41 | +- Runs on the subsidiary-owned host or VPC. |
| 42 | +- Owns local `node-pty`, Codex CLI, repo checkout, shell allowlist, MCP clients, and GitHub access. |
| 43 | +- Accepts signed task requests from the control plane. |
| 44 | +- Returns structured task events, output chunks, status, and final result. |
| 45 | + |
| 46 | +### Policy Engine |
| 47 | + |
| 48 | +- Controls who can use which model, worker, repo, MCP server, shell command, and GitHub operation. |
| 49 | +- Enforces read-only vs write permissions. |
| 50 | +- Requires approval for dangerous actions such as `git push`, repo creation, or production release actions. |
| 51 | + |
| 52 | +### Audit And Observability |
| 53 | + |
| 54 | +- Append-only audit trail for commands, approvals, model usage, and output status. |
| 55 | +- Structured logs, metrics, and health endpoints per worker. |
| 56 | +- Export path to SIEM or internal compliance tooling. |
| 57 | + |
| 58 | +## Security Baseline |
| 59 | + |
| 60 | +- Replace `ALLOWED_USER_IDS`-only trust with SSO/OIDC backed identity. |
| 61 | +- Add RBAC roles such as `platform_admin`, `subsidiary_cto`, `reviewer`, and `auditor`. |
| 62 | +- Store tokens in Vault, KMS, or another enterprise secret manager. |
| 63 | +- Require one service account per worker host. |
| 64 | +- Enforce one polling instance per bot token, or move the control plane to webhooks. |
| 65 | +- Keep shell disabled by default. Enable only per worker policy. |
| 66 | + |
| 67 | +## Subagent Strategy |
| 68 | + |
| 69 | +Subagents remain the control-plane execution units. In the enterprise model: |
| 70 | + |
| 71 | +- `codex` stays the coding execution surface. |
| 72 | +- `github` becomes a governed change-management subagent. |
| 73 | +- `mcp` becomes a governed enterprise context subagent. |
| 74 | +- New subagents should cover architecture review, security control review, release governance, and dependency risk review. |
| 75 | + |
| 76 | +Subagents should be triggered only after: |
| 77 | + |
| 78 | +- policy validation |
| 79 | +- worker selection |
| 80 | +- tenant and repo authorization |
| 81 | +- optional approval checks for high-risk actions |
| 82 | + |
| 83 | +## Recommended Deployment Phases |
| 84 | + |
| 85 | +### Phase 1: Harden Current Single-Host Beta |
| 86 | + |
| 87 | +- Migrate core runtime modules to TypeScript. |
| 88 | +- Add structured logs and machine-readable health output. |
| 89 | +- Add real Telegram regression checks beyond `getMe`. |
| 90 | +- Introduce approval gates for dangerous shell and GitHub operations. |
| 91 | + |
| 92 | +### Phase 2: Introduce Control Plane + Worker Split |
| 93 | + |
| 94 | +- Move Telegram bot logic into a central service. |
| 95 | +- Convert the current runtime into a worker daemon with a signed RPC interface. |
| 96 | +- Persist chat state and audit events in a database instead of a local JSON file. |
| 97 | + |
| 98 | +### Phase 3: Enterprise Governance |
| 99 | + |
| 100 | +- Integrate SSO/OIDC, RBAC, and centralized policy. |
| 101 | +- Add multi-tenant worker registry and tenant-scoped routing. |
| 102 | +- Add formal release, rollback, and disaster recovery procedures. |
| 103 | + |
| 104 | +## TypeScript Recommendation |
| 105 | + |
| 106 | +For enterprise rollout, migrate the following first: |
| 107 | + |
| 108 | +- `src/config.js` |
| 109 | +- `src/orchestrator/router.js` |
| 110 | +- `src/orchestrator/skillRegistry.js` |
| 111 | +- `src/orchestrator/mcpClient.js` |
| 112 | +- `src/runner/ptyManager.js` |
| 113 | +- `src/runner/shellManager.js` |
| 114 | + |
| 115 | +TypeScript matters here because config shape, skill contracts, worker RPC payloads, and audit event schemas must remain stable across teams and releases. |
| 116 | + |
| 117 | +## First-Time Installation Guidance For Subsidiaries |
| 118 | + |
| 119 | +- Install Node.js 20+ and Codex CLI. |
| 120 | +- Complete `codex login` on the worker host before starting the bot. |
| 121 | +- Use a dedicated service account and a dedicated bot token per environment. |
| 122 | +- Set `WORKSPACE_ROOT`, `CODEX_WORKDIR`, and `GITHUB_DEFAULT_WORKDIR` to controlled directories only. |
| 123 | +- Start with `SHELL_ENABLED=false`. |
| 124 | +- Run: |
| 125 | + |
| 126 | +```bash |
| 127 | +npm install |
| 128 | +npm run ci |
| 129 | +npm run healthcheck:strict |
| 130 | +npm run telegram:smoke |
| 131 | +``` |
| 132 | + |
| 133 | +- Deploy with PM2 or another formal supervisor, not an ad hoc terminal session. |
| 134 | + |
| 135 | +## Current Gap Summary |
| 136 | + |
| 137 | +The current repository already has: |
| 138 | + |
| 139 | +- PTY fallback and PTY preflight repair |
| 140 | +- per-project chat context |
| 141 | +- MCP and GitHub subagents |
| 142 | +- local health checks, CI, smoke checks, and release workflow |
| 143 | + |
| 144 | +It still lacks: |
| 145 | + |
| 146 | +- multi-worker control plane |
| 147 | +- enterprise identity and RBAC |
| 148 | +- approvals and policy enforcement |
| 149 | +- centralized audit storage |
| 150 | +- tenant isolation |
| 151 | +- TypeScript contracts for long-term maintainability |
0 commit comments