Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions examples/openclaw-agent/openclaw-cron-timeout.example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"$schema": "https://openclaw.dev/schemas/config.json",
"name": "aegis-cron-timeout-shim",
"description": "Reference OpenClaw config snippet demonstrating the models.providers.<provider>.timeoutSeconds knob that #4808 uses to raise the per-provider timeout ceiling for non-trivial isolated agentTurn cron payloads. Apply with scripts/devops/add-cron-timeout-overrides.sh.",
"models": {
"mode": "merge",
"providers": {
"minimax-portal": {
"timeoutSeconds": 600
},
"kimi": {
"timeoutSeconds": 600
},
"zai": {
"timeoutSeconds": 600
}
}
}
}
115 changes: 115 additions & 0 deletions scripts/devops/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Cron Timeout Override Shim

**Issue:** [#4808](https://github.com/OneStepAt4time/aegis/issues/4808) — Lane B of [#4755](https://github.com/OneStepAt4time/aegis/issues/4755).

## Problem

Non-trivial `isolated agentTurn` cron payloads time out per-provider during
the OpenClaw sequential fallback chain. Observed: 5 providers × ~2.5min ≈
13min exceeds each provider's per-call timeout for complex multi-step
workloads (release-please pre-flight). The cron fails with
`FallbackSummaryError: All models failed (5)`.

## Why this script exists

The root fix is upstream ([openclaw/openclaw#95408](https://github.com/openclaw/openclaw/issues/95408) —
per-agent `model.requestTimeoutSeconds`, Lane C, Hermes). Until that
merges + ships + this host upgrades, we need a workaround on the Aegis
side.

The workaround: bump `models.providers.<provider>.timeoutSeconds` for the
3 unique providers used by `ag-hermes` (the agent that runs the
release-please cron). OpenClaw 2026.5.7 reads this knob at
`model-f6pqrkVH.js:348` (`applyConfiguredProviderOverrides`).

This script applies the override idempotently.

## Why it's safe (global per-provider, not per-agent)

The OpenClaw 2026.5.7 schema only honors `timeoutSeconds` at the
`models.providers.<provider>` level, not per-agent. Setting it raises the
ceiling for every agent that uses those providers. This is acceptable:

- **Simple-payload crons** (watchdog, qa-scan, sentinel) complete in ~30s,
well under any reasonable `timeoutSeconds` value. The bump is invisible.
- **Outer cron-level bound** (`payload.timeoutSeconds`) is unchanged.
Each cron still has its own outer timeout (e.g., 120s for watchdog,
900s for release-please). Bumping the inner per-provider timeout
doesn't extend those.
- **Cost ceiling** is the same — the LLM call still pays per token, just
gets more wall-clock before giving up.

The shim is documented as a workaround. Once Lane C merges + ships +
this host upgrades, the override can be reverted by deleting the
`timeoutSeconds` field from each provider in `~/.openclaw/openclaw.json`.

## Usage

```bash
# DRY-RUN (default) — show what would change
bash scripts/devops/add-cron-timeout-overrides.sh

# Apply the default 600s (10min) override
APPLY=1 bash scripts/devops/add-cron-timeout-overrides.sh

# Apply a custom timeout
TIMEOUT_SECONDS=900 APPLY=1 bash scripts/devops/add-cron-timeout-overrides.sh

# Apply to a subset of providers
TARGET_PROVIDERS="minimax-portal zai" APPLY=1 bash scripts/devops/add-cron-timeout-overrides.sh

# Non-default install path
OPENCLAW_CONFIG=/path/to/openclaw.json APPLY=1 bash scripts/devops/add-cron-timeout-overrides.sh
```

Default timeout: **600s (10min)** — 4× the observed ~2.5min per-provider
ceiling, giving headroom for ~2× LLM round-trip variance.

## Re-enabling the release-please cron

After applying the override, the `ad1ab50a-dba8-40e2-a3de-ca2d2d09dba5`
cron (release-please dispatch) can be re-enabled. The current state has
it disabled with `sessionTarget: "session:agent:ag-hermes:..."` (named
session, from Hephaestus's prior failed workaround on the named-session
lock-in bug).

The cron config update is manual at the `~/.openclaw/cron/jobs.json`
level. Two changes required:

1. Set `enabled: true`
2. Change `sessionTarget` back to `"isolated"`
3. Update the prompt to a current release-please dispatch (the current
one references issue #4708 and a 2026-06-16 memory file)

The cron daemon picks up the change on its next read cycle (< 60s).

## Restart the OpenClaw gateway

The new `timeoutSeconds` takes effect on the next gateway reload. To
pick up immediately:

```bash
openclaw gateway restart
```

Then trigger one manual isolated agentTurn run on `ad1ab50a` to verify
the new ceiling holds.

## Tests

`scripts/devops/__tests__/add-cron-timeout-overrides.test.ts` covers:

1. DRY-RUN does not modify the config
2. APPLY=1 sets `timeoutSeconds` on each target provider
3. `TIMEOUT_SECONDS` env var overrides the default
4. Idempotency (re-running is a no-op)
5. Skip semantics (providers already at-or-above target)
6. Scope (`TARGET_PROVIDERS` env var)
7. Error paths (missing config, invalid timeout, malformed config)
8. Partial success (missing target provider doesn't abort other updates)

Run with:

```bash
npx vitest run scripts/devops/__tests__/add-cron-timeout-overrides.test.ts
```
258 changes: 258 additions & 0 deletions scripts/devops/__tests__/add-cron-timeout-overrides.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
/**
* Regression tests for scripts/devops/add-cron-timeout-overrides.sh
*
* Covers #4808 (Lane B of #4755). The script applies a per-provider
* `timeoutSeconds` override to the OpenClaw config so non-trivial isolated
* agentTurn cron payloads don't time out per-provider during the
* sequential fallback chain.
*
* These tests run the actual bash script against fixture OpenClaw config
* files in a temp directory. They verify:
* 1. DRY-RUN mode does NOT modify the config
* 2. APPLY=1 mode sets the timeoutSeconds on each target provider
* 3. Idempotency: re-running with the same target leaves the config unchanged
* 4. Skip semantics: providers already at-or-above target are skipped
* 5. Error path: missing jq, missing config, invalid timeout value
*
* Requires `bash` and `jq` on PATH (same as the script itself).
*/
import { execFileSync } from 'node:child_process';
import { mkdtempSync, writeFileSync, readFileSync, rmSync } from 'node:fs';
import { tmpdir } from 'node:os';
import { join, resolve } from 'node:path';
import { describe, it, expect, beforeEach, afterEach } from 'vitest';

const REPO_ROOT = resolve(__dirname, '../../..');
const SCRIPT_PATH = join(REPO_ROOT, 'scripts/devops/add-cron-timeout-overrides.sh');

interface OpenClawConfigFixture {
models: {
mode: string;
providers: Record<string, Record<string, unknown>>;
};
}

function makeFixture(
overrides: Partial<Record<string, Record<string, unknown>>> = {},
): OpenClawConfigFixture {
return {
models: {
mode: 'merge',
providers: {
'minimax-portal': { baseUrl: 'https://example.test' },
kimi: { baseUrl: 'https://example.test' },
zai: { baseUrl: 'https://example.test' },
'unrelated-provider': { baseUrl: 'https://example.test' },
...overrides,
},
},
};
}

function runScript(params: {
configPath: string;
env?: Record<string, string>;
apply?: boolean;
}): { stdout: string; stderr: string; status: number } {
const env: Record<string, string> = {
...process.env,
OPENCLAW_CONFIG: params.configPath,
...(params.apply ? { APPLY: '1' } : {}),
...(params.env ?? {}),
};
try {
const stdout = execFileSync('bash', [SCRIPT_PATH], {
env,
encoding: 'utf8',
stdio: ['ignore', 'pipe', 'pipe'],
});
return { stdout, stderr: '', status: 0 };
} catch (err) {
const e = err as { stdout?: string; stderr?: string; status?: number };
return {
stdout: e.stdout ?? '',
stderr: e.stderr ?? '',
status: e.status ?? 1,
};
}
}

describe('add-cron-timeout-overrides.sh', () => {
let workDir: string;

beforeEach(() => {
workDir = mkdtempSync(join(tmpdir(), 'cron-timeout-shim-test-'));
});

afterEach(() => {
rmSync(workDir, { recursive: true, force: true });
});

function writeFixture(config: OpenClawConfigFixture): string {
const path = join(workDir, 'openclaw.json');
writeFileSync(path, JSON.stringify(config, null, 2));
return path;
}

function readConfig(path: string): OpenClawConfigFixture {
return JSON.parse(readFileSync(path, 'utf8')) as OpenClawConfigFixture;
}

it('DRY-RUN mode does not modify the config', () => {
const configPath = writeFixture(makeFixture());

const { stdout, status } = runScript({ configPath });

expect(status).toBe(0);
expect(stdout).toContain('DRY-RUN');

const config = readConfig(configPath);
expect(config.models.providers['minimax-portal'].timeoutSeconds).toBeUndefined();
expect(config.models.providers.kimi.timeoutSeconds).toBeUndefined();
expect(config.models.providers.zai.timeoutSeconds).toBeUndefined();
});

it('APPLY=1 sets timeoutSeconds on each target provider', () => {
const configPath = writeFixture(makeFixture());

const { stdout, status } = runScript({ configPath, apply: true });

expect(status).toBe(0);
expect(stdout).toContain('APPLY');

const config = readConfig(configPath);
expect(config.models.providers['minimax-portal'].timeoutSeconds).toBe(600);
expect(config.models.providers.kimi.timeoutSeconds).toBe(600);
expect(config.models.providers.zai.timeoutSeconds).toBe(600);
});

it('APPLY=1 with TIMEOUT_SECONDS uses the override value', () => {
const configPath = writeFixture(makeFixture());

const { status } = runScript({
configPath,
apply: true,
env: { TIMEOUT_SECONDS: '900' },
});

expect(status).toBe(0);
const config = readConfig(configPath);
expect(config.models.providers['minimax-portal'].timeoutSeconds).toBe(900);
expect(config.models.providers.zai.timeoutSeconds).toBe(900);
});

it('idempotent: re-running leaves the config unchanged after first apply', () => {
const configPath = writeFixture(makeFixture());

const first = runScript({ configPath, apply: true });
expect(first.status).toBe(0);

const afterFirst = readFileSync(configPath, 'utf8');

const second = runScript({ configPath, apply: true });
expect(second.status).toBe(0);
expect(second.stdout).toContain('Already at or above target (skipped): 3');

const afterSecond = readFileSync(configPath, 'utf8');
expect(afterSecond).toBe(afterFirst);
});

it('skips providers already at or above the target timeout', () => {
const configPath = writeFixture(
makeFixture({
'minimax-portal': { timeoutSeconds: 900 },
}),
);

const { stdout, status } = runScript({ configPath, apply: true });

expect(status).toBe(0);
const config = readConfig(configPath);
expect(config.models.providers['minimax-portal'].timeoutSeconds).toBe(900);
expect(config.models.providers.kimi.timeoutSeconds).toBe(600);
expect(config.models.providers.zai.timeoutSeconds).toBe(600);

expect(stdout).toContain('already has timeoutSeconds=900');
});

it('does not touch providers outside TARGET_PROVIDERS', () => {
const configPath = writeFixture(makeFixture());

const { status } = runScript({ configPath, apply: true });

expect(status).toBe(0);
const config = readConfig(configPath);
expect(config.models.providers['unrelated-provider'].timeoutSeconds).toBeUndefined();
});

it('TARGET_PROVIDERS env var scopes the patch', () => {
const configPath = writeFixture(makeFixture());

const { status } = runScript({
configPath,
apply: true,
env: { TARGET_PROVIDERS: 'zai' },
});

expect(status).toBe(0);
const config = readConfig(configPath);
expect(config.models.providers.zai.timeoutSeconds).toBe(600);
expect(config.models.providers['minimax-portal'].timeoutSeconds).toBeUndefined();
expect(config.models.providers.kimi.timeoutSeconds).toBeUndefined();
});

it('exits non-zero when config file is missing', () => {
const missing = join(workDir, 'does-not-exist.json');
const { status, stderr } = runScript({ configPath: missing });

expect(status).not.toBe(0);
expect(stderr).toContain('not found');
});

it('exits non-zero when TIMEOUT_SECONDS is invalid', () => {
const configPath = writeFixture(makeFixture());

const { status, stderr } = runScript({
configPath,
apply: true,
env: { TIMEOUT_SECONDS: 'not-a-number' },
});

expect(status).not.toBe(0);
expect(stderr).toContain('TIMEOUT_SECONDS must be a positive integer');
});

it('exits non-zero when config lacks models.providers object', () => {
const bogus = join(workDir, 'bogus.json');
writeFileSync(bogus, JSON.stringify({ meta: { foo: 'bar' } }));

const { status, stderr } = runScript({ configPath: bogus });

expect(status).not.toBe(0);
expect(stderr).toContain('does not have a models.providers object');
});

it('reports missing target provider in summary without aborting other updates', () => {
const fixture: OpenClawConfigFixture = {
models: {
mode: 'merge',
providers: {
'minimax-portal': { baseUrl: 'https://example.test' },
zai: { baseUrl: 'https://example.test' },
},
},
};
const configPath = writeFixture(fixture);

const { stdout, status } = runScript({ configPath, apply: true });

expect(status).toBe(0);
// The script uses an em-dash and 'not found' marker; assert on the stable parts.
expect(stdout).toMatch(/kimi\s+\S+\s+not found in models\.providers/);
expect(stdout).toContain('Provider not found in config: 1');

const config = readConfig(configPath);
expect(config.models.providers['minimax-portal'].timeoutSeconds).toBe(600);
expect(config.models.providers.zai.timeoutSeconds).toBe(600);
});
});
Loading
Loading