docs: add anonymizer Claude Code skill and supporting concept docs#153
docs: add anonymizer Claude Code skill and supporting concept docs#153lipikaramaswamy wants to merge 5 commits into
Conversation
- skills/anonymizer/ — Claude Code skill (SKILL.md + interactive workflow) that walks users through configuring Anonymizer. - docs/concepts/choosing-a-strategy.md — decision guide for mode (Replace vs Rewrite), strategy, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference. - docs/troubleshooting.md — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures. - mkdocs.yml — add the two new docs to navigation. - README.md — add "Using with Claude Code" section pointing at the skills.sh installer. - src/anonymizer/__init__.py — export PrivacyGoal at the top level (referenced from the skill's output template). - docs/concepts/detection.md — minor wording polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the workflow jumped from "verify install" straight to data inspection and only mentioned provider / API-key setup in the reactive Troubleshooting section. The agent would discover a missing provider only after the user spent time on data inspection and clarification, then watched preview fail. - Extend step 1 to also verify provider config exists (API key env var + providers.yaml) and STOP with a pointer at docs/concepts/models.md if either is missing. - Add a Clarify-step question that asks whether to use shipped defaults or a custom providers.yaml, so the generated script can pass the path via Anonymizer(model_providers=...). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the script the agent generates prints summary stats but doesn't persist any trace data. Investigating 'why was this entity kept/dropped?' or 'why did this row never converge during repair?' required re-running an 8-minute preview, paying tokens again. - Output template now writes result.trace_dataframe to preview.parquet on every preview run. trace_dataframe is a superset of the user-facing dataframe (it includes all internal columns). - Single parquet file (not CSV + parquet) for format consistency. trace columns include dict/list values that don't round-trip through CSV cleanly. - interactive.md step 6 deeper-inspection line updated to point at the saved file instead of suggesting an interactive Python re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Greptile SummaryThis PR adds a Claude Code skill (
Confidence Score: 4/5Safe to merge for docs and the init.py export; the generated script template in SKILL.md has two broken lines that will surface as runtime errors on every user's first full run. The result.output_path access in the --full branch will raise AttributeError on AnonymizerResult for every user who follows the skill's step 8, since that attribute does not exist on the result object. Combined with the already-tracked to_parquet issue (#152), the template fails on both the preview save and the full-run save path before any useful output reaches the user. skills/anonymizer/SKILL.md — the Output Template's full-run branch and preview-save line both need fixing before users can complete the end-to-end workflow. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User invokes /anonymizer] --> B[Step 1: Verify env\ncheck import + API key + providers.yaml]
B -->|Missing| Z[STOP: walk user through setup]
B -->|OK| C[Step 2: Inspect data\npandas read first rows]
C --> D[Step 3: Clarify\ngoal · mode · domain labels · risk tolerance]
D --> E[Step 4: Plan\nstate config intent, confirm with user]
E --> F[Step 5: Build\nwrite anonymize_*.py from Output Template]
F --> G[Step 6: Preview\npython script.py]
G --> H{failed_records?}
H -->|Yes| I[STOP: fix infra/rate-limit issue\ndocs/troubleshooting.md]
I --> G
H -->|No| J[Read leakage / utility summary\nload preview.parquet for trace]
J --> K{Quality OK?}
K -->|No| L[Step 7: Iterate\napply knob change, re-preview]
L --> G
K -->|Yes| M[Step 8: Finalize\npython script.py --full]
Reviews (3): Last reviewed commit: "docs(skill): address review on data_summ..." | Re-trigger Greptile |
|
Once #149 merges, this PR needs a follow-up edit to AGENTS.md:
|
| - The domain (clinical, legal, financial, customer support, etc.) | ||
| - The genre (notes, transcripts, opinions, biographies) | ||
| - Anything about the source the engine couldn't infer from a single record (e.g. "transcribed phone calls — expect disfluencies") | ||
| - *(Augmenter-only)* When you leave `entity_labels=None`, this is currently also the only place to nudge the augmenter LLM **away** from inventing labels you don't care about (e.g. "do not tag generic anatomical terms, medication class names, or job titles as PII"). Treat it as a soft do-not-tag list. |
There was a problem hiding this comment.
I'm not following. I thought entity_labels=None means use the default list and the augmenter can add what ever other entities/labels it deems necessary.
There was a problem hiding this comment.
What I was trying to say here is that when entity_labels=None, yes the augmenter goes and finds whatever, but the only way to tell it what not to find is by saying it in data_summary. This is what i did for nemotron logs.
There was a problem hiding this comment.
Ok, rewrote the bullet to lead with what data_summary is for
| - Healthcare: `mrn`, `clinical_facility`, `diagnosis_code`, `medication_name` | ||
| - Legal: `case_number`, `court_name`, `docket_number`, `judge_name` | ||
| - Customer support: `ticket_id`, `internal_user_id`, `transaction_id` | ||
| - Internal: `employee_id`, `cost_center`, `internal_project_codename` |
There was a problem hiding this comment.
mrn is in the default list, as is employee-id and possibly case_number and court_name
There was a problem hiding this comment.
Good catch — fixed in de0ab26. Cross-checked all examples against the actual DEFAULT_ENTITY_LABELS list and dropped redundant ones (mrn, court_name, employee_id). Also switched all examples to snake_case to match the convention — the validator only strips/lowercases, so medical record number and medical_record_number would have been treated as different labels.
Note on case_number: verified it's not in DEFAULT_ENTITY_LABELS, so I kept that one
| | Is the goal "produce a privacy-safe version of this text that downstream models can train on"? | — | ✅ | | ||
| | Are there inferable / latent identifiers that aren't explicitly stated (e.g. "ringing the bell" → cancer treatment)? | ❌ leaves them | ✅ removes them | | ||
| | Cost per record | ~1 LLM call (Substitute) or 0 LLM calls (Redact/Annotate/Hash) | Many LLM calls (domain → disposition → QA → rewrite → evaluate → repair → judge) | | ||
| | Output text length | ≈ same as input | Often shorter / restructured | |
There was a problem hiding this comment.
There's still an LLM call for redact/annotate/hash.
There was a problem hiding this comment.
I had meant that to be number of llm calls over and above detection here. But rephrased to clarify.
| - **Mode**: Replace (entities only) vs Rewrite (full text transformation that also removes inferable identifiers like "ringing the bell" → cancer)? Default per the rule in `SKILL.md`. | ||
| - **For Replace**: which strategy? (`Substitute` for realistic-looking, `Redact` for explicit `[REDACTED_…]` markers, `Hash` for stable cross-row identifiers, `Annotate` for inspection only). | ||
| - **For Rewrite**: what must be protected? what must be preserved? how strict (`risk_tolerance`)? Read [`docs/concepts/choosing-a-strategy.md`](../../../docs/concepts/choosing-a-strategy.md) sections 5–6 with the user's answers in mind. | ||
| - **Domain-specific entity labels** the defaults won't cover (e.g. `"medical record number"`, `"case number"`, `"internal project codename"`). If yes, read [`docs/concepts/choosing-a-strategy.md`](../../../docs/concepts/choosing-a-strategy.md) section 2. |
There was a problem hiding this comment.
Again, medical record number and case number are bad examples
| # Usage Tips and Common Pitfalls | ||
|
|
||
| - **`Detect.entity_labels=None` (the default) is permissive** — the augmenter LLM may invent labels not in `DEFAULT_ENTITY_LABELS`. Setting an explicit list switches to **strict mode** where *only* the listed labels are detected. To add domain labels, *extend* the default, don't replace it: `entity_labels=DEFAULT_ENTITY_LABELS + ["medical record number", ...]`. | ||
| - **GLiNER is zero-shot** — entity labels are natural-language strings (`"medical record number"`, `"internal project codename"`), not codes or enum values. Any label you can describe in English is a label GLiNER can detect. |
There was a problem hiding this comment.
Again, medical record number is in the default list
Address review feedback from @asteier2026 on PR #153: - `mrn`, `court_name`, and `employee_id` were listed as examples of domain-specific labels to *extend* the default list with, but all three are already in DEFAULT_ENTITY_LABELS (as `medical_record_number`, `court_name`, and `employee_id`). - Some examples used space-separated natural-language strings ("medical record number", "internal project codename") while the default list uses snake_case. Spaces are not normalized by the validator, so the two forms would map to different labels. Swap in non-default snake_case examples (e.g. `clinical_facility`, `diagnosis_code`, `case_number`, `internal_project_codename`) and add one sentence noting the snake_case convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
- Rewrite the augmenter-steering bullet in active voice, leading with what data_summary actually is for. Drops the "(Augmenter-only)" parenthetical that was hiding the point. - Rename the "Cost per record" comparison row to "Additional LLM calls (beyond shared detection)" to make explicit that detection runs in both modes and these numbers exclude it — so the 0 for Redact/Annotate/Hash is correct (no additional LLM work beyond detection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
| if args.full: | ||
| result = anonymizer.run(config=config, data=data) | ||
| out_path = result.output_path | ||
| print(f"Wrote {len(result.dataframe)} rows to {out_path}") |
There was a problem hiding this comment.
AnonymizerResult has no output_path attribute — it only exposes dataframe, trace_dataframe, and failed_records (see src/anonymizer/interface/results.py). Accessing result.output_path on the full-run path will raise AttributeError every time the user runs with --full, before any useful output is printed or saved.
| if args.full: | |
| result = anonymizer.run(config=config, data=data) | |
| out_path = result.output_path | |
| print(f"Wrote {len(result.dataframe)} rows to {out_path}") | |
| if args.full: | |
| result = anonymizer.run(config=config, data=data) | |
| out_path = "output.parquet" # TODO: change path/format (.csv, .jsonl) as needed | |
| result.dataframe.to_parquet(out_path) | |
| print(f"Wrote {len(result.dataframe)} rows to {out_path}") |
Summary
skills/anonymizer/— elicits dataset context, recommends mode + strategy, drafts a runnable Python script, iterates with the user via the workflow ininteractive.md.docs/concepts/choosing-a-strategy.md(215L) — decision guide for mode (Replace vs Rewrite), strategy choice, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference.docs/troubleshooting.md(210L) — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures.README.md— new "Using with Claude Code" section pointing at theskills.shinstaller.mkdocs.yml— adds the two new docs to navigation.src/anonymizer/__init__.py— exportsPrivacyGoalat the top level (previously only accessible via deep import despite being referenced in docs and the skill's output template).docs/concepts/detection.md— minor wording polish.Testing notes
Tested end-to-end with 20 PMC-Patients clinical case reports. The interactive workflow was exercised: install verify → provider check → data inspect → clarify → plan → build → preview → results inspection. Two real workflow gaps surfaced and fixed during testing:
8ccd277)preview.parquetfor inspection (commit5c5d5a0)The skill loader was also verified:
npx skills add(skills.sh) discovers the skill correctly via the SKILL.md frontmatter.Related issues filed during this work
llm_replace_workflow.py:53-55withCOL_*constants (existing STYLEGUIDE rule violation)result.trace_dataframenot persistable viato_parquet. Blocks the skill'spreview.parquetsave from being readable until fixed. Same 3 lines as chore: use COL_* constants for intermediate columns in llm_replace_workflow.py #148.Test plan
format-check,copyright-check,mkdocs build --strict)from anonymizer import PrivacyGoalsucceedsnpx skills addagainst this branch