Skip to content

Commit c6d68f1

Browse files
authored
Merge pull request #160 from DenisValeev/codex/implement-autonomous-mode-functionality
Add external onboarding documentation
2 parents 01baa41 + 38b14ff commit c6d68f1

3 files changed

Lines changed: 82 additions & 5 deletions

File tree

docs/docs/data-maintenance.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,25 +15,28 @@ Content-heavy apps rely on shared datasets stored under `data/` plus machine-gen
1515
## Updating deck datasets
1616

1717
1. Edit the dataset files under `apps/jokes/jokes.js`, `apps/quotes/quotes-data.js`, or `apps/slang/slang.js`.
18-
2. Run the syntax and shape checks documented in each app’s `AGENTS.md` file (`node --check ...` and the `node -e` globals check).
18+
2. When evaluating a batch of external jokes, run `node tools/onboard-external-jokes.js --input=path/to/candidates.json` to
19+
surface high-similarity overlaps before touching the curated deck. See
20+
[External content onboarding](external-content-onboarding.md) for a complete walkthrough and output interpretation.
21+
3. Run the syntax and shape checks documented in each app’s `AGENTS.md` file (`node --check ...` and the `node -e` globals check).
1922
- For jokes specifically, ensure every record includes a `sourceId` that matches one of the curated assets surfaced in `apps/asset-observatory/asset-data.js`. New sources should land under `data/` with capture notes so provenance survives future refreshes.
20-
3. Regenerate metadata and manifests:
23+
4. Regenerate metadata and manifests:
2124
```bash
2225
node tools/update-content-metadata.js --dataset=jokes
2326
node tools/update-content-metadata.js --dataset=quotes
2427
node tools/update-content-metadata.js --dataset=slang
2528
```
2629
Pass `--dataset=<name>` to target a single collection when needed.
27-
4. Refresh embeddings for duplicate detection. Typical commands:
30+
5. Refresh embeddings for duplicate detection. Typical commands:
2831
```bash
2932
node tools/review-content-similarity.js --dataset=jokes --provider=synthetic --write --update-manifest
3033
node tools/review-content-similarity.js --dataset=jokes --provider=hfspace --model=bienkieu/sentence-embedding --batch-size=8 --report=data/similarity-report-jokes.json
3134
node tools/review-content-similarity.js --dataset=quotes --provider=openai --write --update-manifest
3235
node tools/review-content-similarity.js --dataset=slang --provider=synthetic --threshold-slang=0.8 --write --update-manifest
3336
```
3437
Adjust providers and options according to your API access and the thresholds documented in `apps/slang/AGENTS.md` and the repository README.
35-
5. Review high-similarity pairs and move intentional overlaps into `data/similarity-overrides.json` so the similarity lab highlights them as protected.
36-
6. Commit refreshed datasets, manifests, embeddings, and similarity reports together to keep the bundle consistent.
38+
6. Review high-similarity pairs and move intentional overlaps into `data/similarity-overrides.json` so the similarity lab highlights them as protected.
39+
7. Commit refreshed datasets, manifests, embeddings, and similarity reports together to keep the bundle consistent.
3740

3841
## Regenerating similarity reports
3942

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# External content onboarding
2+
3+
The toolbox uses deterministic embeddings and cosine comparisons to reject
4+
near-duplicate jokes before they ever land in `apps/jokes/jokes.js`. The
5+
`tools/onboard-external-jokes.js` helper ingests a candidate JSON file,
6+
embeds each entry across one or more providers, and reports which jokes can
7+
join the curated deck.
8+
9+
## Candidate file shape
10+
11+
Provide an array of objects where every joke includes at least a setup (stored
12+
as `joke` or `setup`) and optionally a punchline (`punchline` or `answer`).
13+
`id` and `label` fields are optional—missing values are replaced with
14+
`candidate-###` identifiers so the summary stays readable.
15+
16+
```json
17+
[
18+
{
19+
"id": "joke-001",
20+
"label": "Unexpected semicolons",
21+
"joke": "Why did the build fail?",
22+
"punchline": "The compiler thought the semicolon was sus."
23+
}
24+
]
25+
```
26+
27+
Save the array to disk (for example `data/new-jokes.json`) and feed the path to
28+
`--input` or its alias `--candidates`.
29+
30+
## Running the triage script
31+
32+
Evaluate a batch with the default providers and a cosine threshold of `0.8`:
33+
34+
```bash
35+
node tools/onboard-external-jokes.js --input=data/new-jokes.json \
36+
--output=reports/new-jokes-summary.json \
37+
--accepted-output=reports/new-jokes-accepted.json
38+
```
39+
40+
The command:
41+
42+
- Loads the active jokes deck and its stored embeddings.
43+
- Generates deterministic embeddings for any provider missing from disk.
44+
- Compares each candidate against the curated deck using cosine similarity.
45+
- Writes a machine-readable summary plus an optional accepted-only export.
46+
47+
Review the terminal output to spot high-similarity overlaps. The summary JSON
48+
captures metadata such as the candidate file, evaluated providers, rejection
49+
reasons, and the strongest match per joke.
50+
51+
## Customising the evaluation
52+
53+
- `--threshold=<value>` changes the cosine similarity cutoff (defaults to
54+
`0.8`). Lower values admit more jokes, higher ones enforce stricter
55+
deduplication.
56+
- `--providers=synthetic,openai,cohere` restricts which embedding stores to use.
57+
Deterministic fallback vectors kick in automatically when a provider’s store
58+
is missing or incompatible.
59+
- `--existing-embeddings=provider:path` lets you point at non-standard stores
60+
(for example, a freshly generated HF Space batch).
61+
- `--candidate-embeddings=provider:path` reuses pre-computed vectors for the
62+
candidates, skipping deterministic generation when dimensions match.
63+
64+
## After accepting new jokes
65+
66+
1. Inspect `reports/new-jokes-accepted.json` and manually fold the approved
67+
entries into `apps/jokes/jokes.js`.
68+
2. Run the usual dataset maintenance scripts (`node tools/update-content-metadata.js`,
69+
`node tools/review-content-similarity.js`, and
70+
`node tools/generate-asset-report.js`) so manifests, embeddings, and the asset
71+
observatory stay aligned.
72+
3. Commit the refreshed datasets, manifests, similarity reports, and summary
73+
artifacts alongside the onboarding report for traceability.

docs/index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -375,6 +375,7 @@ <h2 data-doc-title>Loading…</h2>
375375
{ id: 'wiki-authoring', title: 'Wiki authoring', file: 'docs/wiki-authoring.md', summary: 'Front matter & commit logs' },
376376
{ id: 'deploy-logs', title: 'Deploy logs', file: 'docs/deploy-logs.md', summary: 'Service worker navigation + Pages deploy log' },
377377
{ id: 'data-maintenance', title: 'Data maintenance', file: 'docs/data-maintenance.md', summary: 'Datasets, manifests, and embeddings' },
378+
{ id: 'external-onboarding', title: 'External content onboarding', file: 'docs/external-content-onboarding.md', summary: 'Triage outside joke decks before import' },
378379
{ id: 'testing-automation', title: 'Testing & automation', file: 'docs/testing-and-automation.md', summary: 'Playwright suite and CI workflows' },
379380
{ id: 'docs-maintenance', title: 'Documentation maintenance', file: 'docs/docs-maintenance.md', summary: 'Keep the knowledge base accurate and linked' },
380381
{ id: 'development-guide', title: 'Development guide', file: 'docs/development-guide.md', summary: 'Coding conventions and contributor tasks' }

0 commit comments

Comments
 (0)