Skip to content

Commit 9b6c282

Browse files
authored
Merge branch 'main' into gabriel/sd-2468-bug-image-within-cells-not-resizing-still
2 parents fcd6053 + 4ba8992 commit 9b6c282

214 files changed

Lines changed: 109981 additions & 6345 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/scripts/package-lock.json

Lines changed: 11 additions & 11 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.github/workflows/ci-superdoc.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ jobs:
4040

4141
- uses: oven-sh/setup-bun@v2
4242
with:
43-
bun-version: 1.3.11
43+
bun-version: 1.3.12
4444

4545
- name: Install canvas system dependencies
4646
run: |
@@ -108,7 +108,7 @@ jobs:
108108

109109
- uses: oven-sh/setup-bun@v2
110110
with:
111-
bun-version: 1.3.11
111+
bun-version: 1.3.12
112112

113113
- name: Install canvas system dependencies
114114
run: |
@@ -195,7 +195,7 @@ jobs:
195195

196196
- uses: oven-sh/setup-bun@v2
197197
with:
198-
bun-version: 1.3.11
198+
bun-version: 1.3.12
199199

200200
- name: Install dependencies
201201
run: pnpm install

.github/workflows/release-cli.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ jobs:
6363
registry-url: 'https://registry.npmjs.org'
6464

6565
- uses: oven-sh/setup-bun@v2
66+
with:
67+
bun-version: 1.3.12
6668

6769
- name: Cache apt packages
6870
uses: actions/cache@v5

.github/workflows/release-sdk.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,8 @@ jobs:
8888
registry-url: "https://registry.npmjs.org"
8989

9090
- uses: oven-sh/setup-bun@v2
91+
with:
92+
bun-version: 1.3.12
9193

9294
- uses: actions/setup-python@v5
9395
with:
@@ -234,6 +236,8 @@ jobs:
234236
registry-url: "https://registry.npmjs.org"
235237

236238
- uses: oven-sh/setup-bun@v2
239+
with:
240+
bun-version: 1.3.12
237241

238242
- uses: actions/setup-python@v5
239243
with:

AGENTS.md

Lines changed: 51 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,22 +118,70 @@ Many packages use `.js` files with JSDoc `@typedef` for type definitions (e.g.,
118118

119119
## AI Eval Suite
120120

121-
The `evals/` directory contains a Promptfoo-based evaluation suite for validating AI tool call quality.
121+
The `evals/` directory contains a Promptfoo-based evaluation suite with three levels of evaluation.
122+
123+
### Level 1: Deterministic Evals (tool selection + argument accuracy)
122124

123125
| Command | What it does | Cost |
124126
|---------|-------------|------|
125127
| `pnpm --filter @superdoc-testing/evals run eval` | Run deterministic evals (reading + argument tests) | ~$0.30 |
126128
| `pnpm --filter @superdoc-testing/evals run eval:reading` | Run reading tool tests only | ~$0.15 |
127-
| `pnpm --filter @superdoc-testing/evals run eval:gdpval` | Run GDPval benchmark (Model+SuperDoc vs Model-Only) | ~$1-2 |
128129
| `pnpm --filter @superdoc-testing/evals run eval:view` | Open Promptfoo web UI with results | Free |
129130
| `pnpm --filter @superdoc-testing/evals run baseline:save <label>` | Save versioned results snapshot | Free |
130131

131132
Tool definitions are extracted from `packages/sdk/tools/` via `evals/tools/extract.mjs`. Run `pnpm run generate:all` first if SDK artifacts are missing.
132133

133-
Test files are YAML in `evals/tests/`. Each test has a `vars.task` prompt and JavaScript assertions that check tool call structure (Level 1: tool selection + argument accuracy, not execution).
134+
Test files are YAML in `evals/tests/`. Each test has a `vars.task` prompt and JavaScript assertions that check tool call structure (tool selection + argument accuracy, not execution).
134135

135136
The system prompt at `evals/prompts/agent.txt` is a copy of the proven prompt from `examples/eval-demo/lib/agent.ts`. Update both when changing the prompt.
136137

138+
### Level 2: GDPval Benchmark (Model+SuperDoc vs Model-Only)
139+
140+
| Command | What it does | Cost |
141+
|---------|-------------|------|
142+
| `pnpm --filter @superdoc-testing/evals run eval:gdpval` | Run GDPval benchmark | ~$1-2 |
143+
144+
### Level 3: DOCX Agent Benchmark (real agents, real documents)
145+
146+
Runs actual Claude Code and Codex CLIs against DOCX tasks, comparing their performance with and without SuperDoc tools. 4 conditions x 2 agents x N tasks.
147+
148+
**Conditions:**
149+
150+
| Condition | What the agent gets |
151+
|-----------|-------------------|
152+
| baseline | No skill, agent figures out DOCX on its own |
153+
| baseline-with-docx-skill | Anthropic's DOCX skill (unzip + XML editing) |
154+
| superdoc-mcp | SuperDoc MCP server (`superdoc_open`, `superdoc_get_content`, etc.) |
155+
| superdoc-cli | SuperDoc CLI on PATH |
156+
157+
**Tasks:** 3 reading (extract headings, entity names, financial figures) + 3 editing (replace entity name, insert section, fill placeholders).
158+
159+
**Metrics per task:** correctness (pass/fail), collateral (no unintended changes), steps (agent turn count), latency (seconds), tokens (input + output), path (which DOCX approach was used).
160+
161+
| Command | What it does | Cost |
162+
|---------|-------------|------|
163+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark` | Run full benchmark | ~15 min |
164+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark:codex` | Run Codex conditions only | ~8 min |
165+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark:claude` | Run Claude Code conditions only | ~8 min |
166+
| `pnpm --filter @superdoc-testing/evals run eval:benchmark:report` | Generate comparison report (Markdown + CSV) | Free |
167+
168+
**Prerequisites:**
169+
- `OPENAI_API_KEY` in `evals/.env` (for Codex; use `codex login --with-api-key` for API key auth)
170+
- Claude Code installed locally (uses local auth, no API key needed in `.env`)
171+
- MCP server built: `cd apps/mcp && pnpm run build`
172+
- CLI built: check `apps/cli/dist/index.js` exists
173+
174+
**Key files:**
175+
176+
| File | Purpose |
177+
|------|---------|
178+
| `evals/config/benchmark.promptfoo.yaml` | Level 3 Promptfoo config (8 providers) |
179+
| `evals/suites/benchmark/tests/agent-benchmark-v2.yaml` | Benchmark tasks with assertions |
180+
| `evals/providers/claude-code-agent.mjs` | Claude Agent SDK provider |
181+
| `evals/providers/codex-agent.mjs` | Codex SDK provider |
182+
| `evals/suites/benchmark/reports/benchmark-report.mjs` | Markdown + CSV report generator |
183+
| `evals/fixtures/vendor/vendor-docx-skill.md` | Anthropic's DOCX skill for baseline-with-docx-skill condition |
184+
137185
## Generated Artifacts
138186

139187
These directories are produced by `pnpm run generate:all`:

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,8 @@ Special thanks to these community members who have contributed code to SuperDoc:
162162
<a href="https://github.com/iguit0"><img src="https://github.com/iguit0.png" width="50" height="50" alt="iguit0" title="Igor Alves" /></a>
163163
<a href="https://github.com/PeterHollens"><img src="https://github.com/PeterHollens.png" width="50" height="50" alt="PeterHollens" title="Peter Hollens" /></a>
164164
<a href="https://github.com/baristaGeek"><img src="https://github.com/baristaGeek.png" width="50" height="50" alt="baristaGeek" title="Esteban Vargas" /></a>
165+
<a href="https://github.com/Anuj52"><img src="https://github.com/Anuj52.png" width="50" height="50" alt="Anuj52" title="Anuj Chaudhary" /></a>
166+
<a href="https://github.com/Abdeltoto"><img src="https://github.com/Abdeltoto.png" width="50" height="50" alt="Abdeltoto" title="Abdel ATIA" /></a>
165167

166168
Want to see your avatar here? Check the [Contributing Guide](CONTRIBUTING.md) to get started.
167169

apps/cli/platforms/cli-darwin-arm64/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@superdoc-dev/cli-darwin-arm64",
3-
"version": "0.5.0",
3+
"version": "0.6.0",
44
"os": [
55
"darwin"
66
],

apps/cli/platforms/cli-darwin-x64/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@superdoc-dev/cli-darwin-x64",
3-
"version": "0.5.0",
3+
"version": "0.6.0",
44
"os": [
55
"darwin"
66
],

apps/cli/platforms/cli-linux-arm64/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@superdoc-dev/cli-linux-arm64",
3-
"version": "0.5.0",
3+
"version": "0.6.0",
44
"os": [
55
"linux"
66
],

apps/cli/platforms/cli-linux-x64/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@superdoc-dev/cli-linux-x64",
3-
"version": "0.5.0",
3+
"version": "0.6.0",
44
"os": [
55
"linux"
66
],

0 commit comments

Comments
 (0)