|
| 1 | +# External content onboarding |
| 2 | + |
| 3 | +The toolbox uses deterministic embeddings and cosine comparisons to reject |
| 4 | +near-duplicate jokes before they ever land in `apps/jokes/jokes.js`. The |
| 5 | +`tools/onboard-external-jokes.js` helper ingests a candidate JSON file, |
| 6 | +embeds each entry across one or more providers, and reports which jokes can |
| 7 | +join the curated deck. |
| 8 | + |
| 9 | +## Candidate file shape |
| 10 | + |
| 11 | +Provide an array of objects where every joke includes at least a setup (stored |
| 12 | +as `joke` or `setup`) and optionally a punchline (`punchline` or `answer`). |
| 13 | +`id` and `label` fields are optional—missing values are replaced with |
| 14 | +`candidate-###` identifiers so the summary stays readable. |
| 15 | + |
| 16 | +```json |
| 17 | +[ |
| 18 | + { |
| 19 | + "id": "joke-001", |
| 20 | + "label": "Unexpected semicolons", |
| 21 | + "joke": "Why did the build fail?", |
| 22 | + "punchline": "The compiler thought the semicolon was sus." |
| 23 | + } |
| 24 | +] |
| 25 | +``` |
| 26 | + |
| 27 | +Save the array to disk (for example `data/new-jokes.json`) and feed the path to |
| 28 | +`--input` or its alias `--candidates`. |
| 29 | + |
| 30 | +## Running the triage script |
| 31 | + |
| 32 | +Evaluate a batch with the default providers and a cosine threshold of `0.8`: |
| 33 | + |
| 34 | +```bash |
| 35 | +node tools/onboard-external-jokes.js --input=data/new-jokes.json \ |
| 36 | + --output=reports/new-jokes-summary.json \ |
| 37 | + --accepted-output=reports/new-jokes-accepted.json |
| 38 | +``` |
| 39 | + |
| 40 | +The command: |
| 41 | + |
| 42 | +- Loads the active jokes deck and its stored embeddings. |
| 43 | +- Generates deterministic embeddings for any provider missing from disk. |
| 44 | +- Compares each candidate against the curated deck using cosine similarity. |
| 45 | +- Writes a machine-readable summary plus an optional accepted-only export. |
| 46 | + |
| 47 | +Review the terminal output to spot high-similarity overlaps. The summary JSON |
| 48 | +captures metadata such as the candidate file, evaluated providers, rejection |
| 49 | +reasons, and the strongest match per joke. |
| 50 | + |
| 51 | +## Customising the evaluation |
| 52 | + |
| 53 | +- `--threshold=<value>` changes the cosine similarity cutoff (defaults to |
| 54 | + `0.8`). Lower values admit more jokes, higher ones enforce stricter |
| 55 | + deduplication. |
| 56 | +- `--providers=synthetic,openai,cohere` restricts which embedding stores to use. |
| 57 | + Deterministic fallback vectors kick in automatically when a provider’s store |
| 58 | + is missing or incompatible. |
| 59 | +- `--existing-embeddings=provider:path` lets you point at non-standard stores |
| 60 | + (for example, a freshly generated HF Space batch). |
| 61 | +- `--candidate-embeddings=provider:path` reuses pre-computed vectors for the |
| 62 | + candidates, skipping deterministic generation when dimensions match. |
| 63 | + |
| 64 | +## After accepting new jokes |
| 65 | + |
| 66 | +1. Inspect `reports/new-jokes-accepted.json` and manually fold the approved |
| 67 | + entries into `apps/jokes/jokes.js`. |
| 68 | +2. Run the usual dataset maintenance scripts (`node tools/update-content-metadata.js`, |
| 69 | + `node tools/review-content-similarity.js`, and |
| 70 | + `node tools/generate-asset-report.js`) so manifests, embeddings, and the asset |
| 71 | + observatory stay aligned. |
| 72 | +3. Commit the refreshed datasets, manifests, similarity reports, and summary |
| 73 | + artifacts alongside the onboarding report for traceability. |
0 commit comments