docs: add anonymizer Claude Code skill and supporting concept docs by lipikaramaswamy · Pull Request #153 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-05-11T16:58:50Z

Summary

Claude Code skill at skills/anonymizer/ — elicits dataset context, recommends mode + strategy, drafts a runnable Python script, iterates with the user via the workflow in interactive.md.
docs/concepts/choosing-a-strategy.md (215L) — decision guide for mode (Replace vs Rewrite), strategy choice, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference.
docs/troubleshooting.md (210L) — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures.
README.md — new "Using with Claude Code" section pointing at the skills.sh installer.
mkdocs.yml — adds the two new docs to navigation.
src/anonymizer/__init__.py — exports PrivacyGoal at the top level (previously only accessible via deep import despite being referenced in docs and the skill's output template).
docs/concepts/detection.md — minor wording polish.

Testing notes

Tested end-to-end with 20 PMC-Patients clinical case reports. The interactive workflow was exercised: install verify → provider check → data inspect → clarify → plan → build → preview → results inspection. Two real workflow gaps surfaced and fixed during testing:

Provider configuration wasn't checked proactively → added step 1 environment verification + step 3 provider question (commit 8ccd277)
Trace data wasn't persisted by default → script now saves preview.parquet for inspection (commit 5c5d5a0)

The skill loader was also verified: npx skills add (skills.sh) discovers the skill correctly via the SKILL.md frontmatter.

Related issues filed during this work

chore: use COL_* constants for intermediate columns in llm_replace_workflow.py #148 — replace string-literal column names in llm_replace_workflow.py:53-55 with COL_* constants (existing STYLEGUIDE rule violation)
bug: result.trace_dataframe is not persistable via to_parquet #152 — result.trace_dataframe not persistable via to_parquet. Blocks the skill's preview.parquet save from being readable until fixed. Same 3 lines as chore: use COL_* constants for intermediate columns in llm_replace_workflow.py #148.

Test plan

CI passes (format-check, copyright-check, mkdocs build --strict)
from anonymizer import PrivacyGoal succeeds
Skill discoverable via npx skills add against this branch
Skill workflow produces a sensible script for a sample dataset (verified manually with PMC-Patients data)

- skills/anonymizer/ — Claude Code skill (SKILL.md + interactive workflow) that walks users through configuring Anonymizer. - docs/concepts/choosing-a-strategy.md — decision guide for mode (Replace vs Rewrite), strategy, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference. - docs/troubleshooting.md — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures. - mkdocs.yml — add the two new docs to navigation. - README.md — add "Using with Claude Code" section pointing at the skills.sh installer. - src/anonymizer/__init__.py — export PrivacyGoal at the top level (referenced from the skill's output template). - docs/concepts/detection.md — minor wording polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

Surfaced during manual skill testing: the workflow jumped from "verify install" straight to data inspection and only mentioned provider / API-key setup in the reactive Troubleshooting section. The agent would discover a missing provider only after the user spent time on data inspection and clarification, then watched preview fail. - Extend step 1 to also verify provider config exists (API key env var + providers.yaml) and STOP with a pointer at docs/concepts/models.md if either is missing. - Add a Clarify-step question that asks whether to use shipped defaults or a custom providers.yaml, so the generated script can pass the path via Anonymizer(model_providers=...). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

Surfaced during manual skill testing: the script the agent generates prints summary stats but doesn't persist any trace data. Investigating 'why was this entity kept/dropped?' or 'why did this row never converge during repair?' required re-running an 8-minute preview, paying tokens again. - Output template now writes result.trace_dataframe to preview.parquet on every preview run. trace_dataframe is a superset of the user-facing dataframe (it includes all internal columns). - Single parquet file (not CSV + parquet) for format consistency. trace columns include dict/list values that don't round-trip through CSV cleanly. - interactive.md step 6 deeper-inspection line updated to point at the saved file instead of suggesting an interactive Python re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

greptile-apps · 2026-05-11T17:03:24Z

Greptile Summary

This PR adds a Claude Code skill (skills/anonymizer/) with an interactive workflow, two new concept docs (choosing-a-strategy.md, troubleshooting.md), and promotes PrivacyGoal to the top-level public API. The docs are thorough, threshold tables are verified against source constants, and the interactive workflow's failure-first protocol is correctly sequenced.

skills/anonymizer/SKILL.md — New skill with a Python output template. Two runtime issues exist in the template: result.output_path (does not exist on AnonymizerResult) causes AttributeError on every --full run, and result.trace_dataframe.to_parquet is blocked by issue bug: result.trace_dataframe is not persistable via to_parquet #152 (already tracked).
docs/concepts/choosing-a-strategy.md + docs/troubleshooting.md — New decision guide and symptom-first troubleshooting reference; all doc cross-links resolve and numeric thresholds match the implementation.
src/anonymizer/__init__.py — Clean export of PrivacyGoal at the package level.

Confidence Score: 4/5

Safe to merge for docs and the init.py export; the generated script template in SKILL.md has two broken lines that will surface as runtime errors on every user's first full run.

The result.output_path access in the --full branch will raise AttributeError on AnonymizerResult for every user who follows the skill's step 8, since that attribute does not exist on the result object. Combined with the already-tracked to_parquet issue (#152), the template fails on both the preview save and the full-run save path before any useful output reaches the user.

skills/anonymizer/SKILL.md — the Output Template's full-run branch and preview-save line both need fixing before users can complete the end-to-end workflow.

Important Files Changed

Filename	Overview
skills/anonymizer/SKILL.md	New Claude Code skill file with a Python output template. Template references `result.output_path` (doesn't exist on AnonymizerResult) and `result.trace_dataframe.to_parquet` (blocked by issue #152), both of which will raise runtime errors on first use.
skills/anonymizer/workflows/interactive.md	New 8-step interactive workflow guide. Steps are logically sequenced; failure-first protocol is correctly emphasized; provider check added as step 1. No issues found.
src/anonymizer/init.py	Adds `PrivacyGoal` to the public API surface. Import resolves correctly to `anonymizer.config.rewrite.PrivacyGoal` (a Pydantic BaseModel), and `__all__` is updated. No issues.
docs/concepts/choosing-a-strategy.md	New 215-line decision guide. Leakage/repair threshold tables verified against `_RISK_TOLERANCE_BUNDLES` in source — all values match. Relative doc links resolve correctly.
docs/troubleshooting.md	New 210-line symptom-first troubleshooting guide. Threshold tables match source constants; relative links to `concepts/models.md`, `concepts/choosing-a-strategy.md` all resolve.
README.md	Adds "Using with Claude Code" section with install command and usage note. No issues.
mkdocs.yml	Adds two new docs to nav. File paths match the newly added files. No issues.
docs/concepts/detection.md	Minor wording change — replaces "exactly the outcome you're trying to avoid" with "an undesired outcome". No functional impact.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User invokes /anonymizer] --> B[Step 1: Verify env\ncheck import + API key + providers.yaml]
    B -->|Missing| Z[STOP: walk user through setup]
    B -->|OK| C[Step 2: Inspect data\npandas read first rows]
    C --> D[Step 3: Clarify\ngoal · mode · domain labels · risk tolerance]
    D --> E[Step 4: Plan\nstate config intent, confirm with user]
    E --> F[Step 5: Build\nwrite anonymize_*.py from Output Template]
    F --> G[Step 6: Preview\npython script.py]
    G --> H{failed_records?}
    H -->|Yes| I[STOP: fix infra/rate-limit issue\ndocs/troubleshooting.md]
    I --> G
    H -->|No| J[Read leakage / utility summary\nload preview.parquet for trace]
    J --> K{Quality OK?}
    K -->|No| L[Step 7: Iterate\napply knob change, re-preview]
    L --> G
    K -->|Yes| M[Step 8: Finalize\npython script.py --full]

_{Reviews (3): Last reviewed commit: "docs(skill): address review on data_summ..." | Re-trigger Greptile}

lipikaramaswamy · 2026-05-11T17:07:20Z

Once #149 merges, this PR needs a follow-up edit to AGENTS.md:

The dev-vs-user redirect at the top should also mention the bundled skill at skills/anonymizer/
A note that public-API changes (especially imports referenced in skills/anonymizer/SKILL.md's output template) may require corresponding skill updates.

asteier2026 · 2026-05-11T20:48:09Z

+- The domain (clinical, legal, financial, customer support, etc.)
+- The genre (notes, transcripts, opinions, biographies)
+- Anything about the source the engine couldn't infer from a single record (e.g. "transcribed phone calls — expect disfluencies")
+- *(Augmenter-only)* When you leave `entity_labels=None`, this is currently also the only place to nudge the augmenter LLM **away** from inventing labels you don't care about (e.g. "do not tag generic anatomical terms, medication class names, or job titles as PII"). Treat it as a soft do-not-tag list.


I'm not following. I thought entity_labels=None means use the default list and the augmenter can add what ever other entities/labels it deems necessary.

What I was trying to say here is that when entity_labels=None, yes the augmenter goes and finds whatever, but the only way to tell it what not to find is by saying it in data_summary. This is what i did for nemotron logs.

Ok, rewrote the bullet to lead with what data_summary is for

asteier2026 · 2026-05-11T20:49:22Z

+- Healthcare: `mrn`, `clinical_facility`, `diagnosis_code`, `medication_name`
+- Legal: `case_number`, `court_name`, `docket_number`, `judge_name`
+- Customer support: `ticket_id`, `internal_user_id`, `transaction_id`
+- Internal: `employee_id`, `cost_center`, `internal_project_codename`


mrn is in the default list, as is employee-id and possibly case_number and court_name

Good catch — fixed in de0ab26. Cross-checked all examples against the actual DEFAULT_ENTITY_LABELS list and dropped redundant ones (mrn, court_name, employee_id). Also switched all examples to snake_case to match the convention — the validator only strips/lowercases, so medical record number and medical_record_number would have been treated as different labels.

Note on case_number: verified it's not in DEFAULT_ENTITY_LABELS, so I kept that one

asteier2026 · 2026-05-11T20:52:09Z

+| Is the goal "produce a privacy-safe version of this text that downstream models can train on"? | — | ✅ |
+| Are there inferable / latent identifiers that aren't explicitly stated (e.g. "ringing the bell" → cancer treatment)? | ❌ leaves them | ✅ removes them |
+| Cost per record | ~1 LLM call (Substitute) or 0 LLM calls (Redact/Annotate/Hash) | Many LLM calls (domain → disposition → QA → rewrite → evaluate → repair → judge) |
+| Output text length | ≈ same as input | Often shorter / restructured |


There's still an LLM call for redact/annotate/hash.

I had meant that to be number of llm calls over and above detection here. But rephrased to clarify.

asteier2026 · 2026-05-11T22:02:49Z

+   - **Mode**: Replace (entities only) vs Rewrite (full text transformation that also removes inferable identifiers like "ringing the bell" → cancer)? Default per the rule in `SKILL.md`.
+   - **For Replace**: which strategy? (`Substitute` for realistic-looking, `Redact` for explicit `[REDACTED_…]` markers, `Hash` for stable cross-row identifiers, `Annotate` for inspection only).
+   - **For Rewrite**: what must be protected? what must be preserved? how strict (`risk_tolerance`)? Read [`docs/concepts/choosing-a-strategy.md`](../../../docs/concepts/choosing-a-strategy.md) sections 5–6 with the user's answers in mind.
+   - **Domain-specific entity labels** the defaults won't cover (e.g. `"medical record number"`, `"case number"`, `"internal project codename"`). If yes, read [`docs/concepts/choosing-a-strategy.md`](../../../docs/concepts/choosing-a-strategy.md) section 2.


Again, medical record number and case number are bad examples

fixed in de0ab26

asteier2026 · 2026-05-11T22:04:21Z

+# Usage Tips and Common Pitfalls
+
+- **`Detect.entity_labels=None` (the default) is permissive** — the augmenter LLM may invent labels not in `DEFAULT_ENTITY_LABELS`. Setting an explicit list switches to **strict mode** where *only* the listed labels are detected. To add domain labels, *extend* the default, don't replace it: `entity_labels=DEFAULT_ENTITY_LABELS + ["medical record number", ...]`.
+- **GLiNER is zero-shot** — entity labels are natural-language strings (`"medical record number"`, `"internal project codename"`), not codes or enum values. Any label you can describe in English is a label GLiNER can detect.


Again, medical record number is in the default list

@asteier2026

Address review feedback from @asteier2026 on PR #153: - `mrn`, `court_name`, and `employee_id` were listed as examples of domain-specific labels to *extend* the default list with, but all three are already in DEFAULT_ENTITY_LABELS (as `medical_record_number`, `court_name`, and `employee_id`). - Some examples used space-separated natural-language strings ("medical record number", "internal project codename") while the default list uses snake_case. Spaces are not normalized by the validator, so the two forms would map to different labels. Swap in non-default snake_case examples (e.g. `clinical_facility`, `diagnosis_code`, `case_number`, `internal_project_codename`) and add one sentence noting the snake_case convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

- Rewrite the augmenter-steering bullet in active voice, leading with what data_summary actually is for. Drops the "(Augmenter-only)" parenthetical that was hiding the point. - Rename the "Cost per record" comparison row to "Additional LLM calls (beyond shared detection)" to make explicit that detection runs in both modes and these numbers exclude it — so the 0 for Redact/Annotate/Hash is correct (no additional LLM work beyond detection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>

greptile-apps · 2026-05-12T00:43:16Z

+    if args.full:
+        result = anonymizer.run(config=config, data=data)
+        out_path = result.output_path
+        print(f"Wrote {len(result.dataframe)} rows to {out_path}")


AnonymizerResult has no output_path attribute — it only exposes dataframe, trace_dataframe, and failed_records (see src/anonymizer/interface/results.py). Accessing result.output_path on the full-run path will raise AttributeError every time the user runs with --full, before any useful output is printed or saved.

Suggested change

if args.full:

result = anonymizer.run(config=config, data=data)

out_path = result.output_path

print(f"Wrote {len(result.dataframe)} rows to {out_path}")

if args.full:

result = anonymizer.run(config=config, data=data)

out_path = "output.parquet" # TODO: change path/format (.csv, .jsonl) as needed

result.dataframe.to_parquet(out_path)

print(f"Wrote {len(result.dataframe)} rows to {out_path}")

lipikaramaswamy and others added 3 commits May 11, 2026 10:22

lipikaramaswamy requested review from a team as code owners May 11, 2026 16:58

asteier2026 reviewed May 11, 2026

View reviewed changes

lipikaramaswamy and others added 2 commits May 11, 2026 20:25

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Conversation

lipikaramaswamy commented May 11, 2026

Summary

Testing notes

Related issues filed during this work

Test plan

Uh oh!

greptile-apps Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

lipikaramaswamy commented May 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 11, 2026 •

edited

Loading