Skip to content

docs: add anonymizer Claude Code skill and supporting concept docs#153

Open
lipikaramaswamy wants to merge 5 commits into
mainfrom
lipikaramaswamy/docs/anonymizer-skill
Open

docs: add anonymizer Claude Code skill and supporting concept docs#153
lipikaramaswamy wants to merge 5 commits into
mainfrom
lipikaramaswamy/docs/anonymizer-skill

Conversation

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

Summary

  • Claude Code skill at skills/anonymizer/ — elicits dataset context, recommends mode + strategy, drafts a runnable Python script, iterates with the user via the workflow in interactive.md.
  • docs/concepts/choosing-a-strategy.md (215L) — decision guide for mode (Replace vs Rewrite), strategy choice, privacy goal phrasing, and detection knobs. Doubles as the primary agent reference.
  • docs/troubleshooting.md (210L) — symptom-first guide for dropped rows, leakage, low utility, and pipeline failures.
  • README.md — new "Using with Claude Code" section pointing at the skills.sh installer.
  • mkdocs.yml — adds the two new docs to navigation.
  • src/anonymizer/__init__.py — exports PrivacyGoal at the top level (previously only accessible via deep import despite being referenced in docs and the skill's output template).
  • docs/concepts/detection.md — minor wording polish.

Testing notes

Tested end-to-end with 20 PMC-Patients clinical case reports. The interactive workflow was exercised: install verify → provider check → data inspect → clarify → plan → build → preview → results inspection. Two real workflow gaps surfaced and fixed during testing:

  • Provider configuration wasn't checked proactively → added step 1 environment verification + step 3 provider question (commit 8ccd277)
  • Trace data wasn't persisted by default → script now saves preview.parquet for inspection (commit 5c5d5a0)

The skill loader was also verified: npx skills add (skills.sh) discovers the skill correctly via the SKILL.md frontmatter.

Related issues filed during this work

Test plan

  • CI passes (format-check, copyright-check, mkdocs build --strict)
  • from anonymizer import PrivacyGoal succeeds
  • Skill discoverable via npx skills add against this branch
  • Skill workflow produces a sensible script for a sample dataset (verified manually with PMC-Patients data)

lipikaramaswamy and others added 3 commits May 11, 2026 10:22
- skills/anonymizer/ — Claude Code skill (SKILL.md + interactive
  workflow) that walks users through configuring Anonymizer.
- docs/concepts/choosing-a-strategy.md — decision guide for mode
  (Replace vs Rewrite), strategy, privacy goal phrasing, and
  detection knobs. Doubles as the primary agent reference.
- docs/troubleshooting.md — symptom-first guide for dropped rows,
  leakage, low utility, and pipeline failures.
- mkdocs.yml — add the two new docs to navigation.
- README.md — add "Using with Claude Code" section pointing at the
  skills.sh installer.
- src/anonymizer/__init__.py — export PrivacyGoal at the top level
  (referenced from the skill's output template).
- docs/concepts/detection.md — minor wording polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the workflow jumped from "verify
install" straight to data inspection and only mentioned provider /
API-key setup in the reactive Troubleshooting section. The agent would
discover a missing provider only after the user spent time on data
inspection and clarification, then watched preview fail.

- Extend step 1 to also verify provider config exists (API key env var
  + providers.yaml) and STOP with a pointer at docs/concepts/models.md
  if either is missing.
- Add a Clarify-step question that asks whether to use shipped defaults
  or a custom providers.yaml, so the generated script can pass the path
  via Anonymizer(model_providers=...).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Surfaced during manual skill testing: the script the agent generates
prints summary stats but doesn't persist any trace data. Investigating
'why was this entity kept/dropped?' or 'why did this row never converge
during repair?' required re-running an 8-minute preview, paying tokens
again.

- Output template now writes result.trace_dataframe to preview.parquet
  on every preview run. trace_dataframe is a superset of the user-facing
  dataframe (it includes all internal columns).
- Single parquet file (not CSV + parquet) for format consistency.
  trace columns include dict/list values that don't round-trip through
  CSV cleanly.
- interactive.md step 6 deeper-inspection line updated to point at the
  saved file instead of suggesting an interactive Python re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
@lipikaramaswamy lipikaramaswamy requested review from a team as code owners May 11, 2026 16:58
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR adds a Claude Code skill (skills/anonymizer/) with an interactive workflow, two new concept docs (choosing-a-strategy.md, troubleshooting.md), and promotes PrivacyGoal to the top-level public API. The docs are thorough, threshold tables are verified against source constants, and the interactive workflow's failure-first protocol is correctly sequenced.

  • skills/anonymizer/SKILL.md — New skill with a Python output template. Two runtime issues exist in the template: result.output_path (does not exist on AnonymizerResult) causes AttributeError on every --full run, and result.trace_dataframe.to_parquet is blocked by issue bug: result.trace_dataframe is not persistable via to_parquet #152 (already tracked).
  • docs/concepts/choosing-a-strategy.md + docs/troubleshooting.md — New decision guide and symptom-first troubleshooting reference; all doc cross-links resolve and numeric thresholds match the implementation.
  • src/anonymizer/__init__.py — Clean export of PrivacyGoal at the package level.

Confidence Score: 4/5

Safe to merge for docs and the init.py export; the generated script template in SKILL.md has two broken lines that will surface as runtime errors on every user's first full run.

The result.output_path access in the --full branch will raise AttributeError on AnonymizerResult for every user who follows the skill's step 8, since that attribute does not exist on the result object. Combined with the already-tracked to_parquet issue (#152), the template fails on both the preview save and the full-run save path before any useful output reaches the user.

skills/anonymizer/SKILL.md — the Output Template's full-run branch and preview-save line both need fixing before users can complete the end-to-end workflow.

Important Files Changed

Filename Overview
skills/anonymizer/SKILL.md New Claude Code skill file with a Python output template. Template references result.output_path (doesn't exist on AnonymizerResult) and result.trace_dataframe.to_parquet (blocked by issue #152), both of which will raise runtime errors on first use.
skills/anonymizer/workflows/interactive.md New 8-step interactive workflow guide. Steps are logically sequenced; failure-first protocol is correctly emphasized; provider check added as step 1. No issues found.
src/anonymizer/init.py Adds PrivacyGoal to the public API surface. Import resolves correctly to anonymizer.config.rewrite.PrivacyGoal (a Pydantic BaseModel), and __all__ is updated. No issues.
docs/concepts/choosing-a-strategy.md New 215-line decision guide. Leakage/repair threshold tables verified against _RISK_TOLERANCE_BUNDLES in source — all values match. Relative doc links resolve correctly.
docs/troubleshooting.md New 210-line symptom-first troubleshooting guide. Threshold tables match source constants; relative links to concepts/models.md, concepts/choosing-a-strategy.md all resolve.
README.md Adds "Using with Claude Code" section with install command and usage note. No issues.
mkdocs.yml Adds two new docs to nav. File paths match the newly added files. No issues.
docs/concepts/detection.md Minor wording change — replaces "exactly the outcome you're trying to avoid" with "an undesired outcome". No functional impact.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User invokes /anonymizer] --> B[Step 1: Verify env\ncheck import + API key + providers.yaml]
    B -->|Missing| Z[STOP: walk user through setup]
    B -->|OK| C[Step 2: Inspect data\npandas read first rows]
    C --> D[Step 3: Clarify\ngoal · mode · domain labels · risk tolerance]
    D --> E[Step 4: Plan\nstate config intent, confirm with user]
    E --> F[Step 5: Build\nwrite anonymize_*.py from Output Template]
    F --> G[Step 6: Preview\npython script.py]
    G --> H{failed_records?}
    H -->|Yes| I[STOP: fix infra/rate-limit issue\ndocs/troubleshooting.md]
    I --> G
    H -->|No| J[Read leakage / utility summary\nload preview.parquet for trace]
    J --> K{Quality OK?}
    K -->|No| L[Step 7: Iterate\napply knob change, re-preview]
    L --> G
    K -->|Yes| M[Step 8: Finalize\npython script.py --full]
Loading

Reviews (3): Last reviewed commit: "docs(skill): address review on data_summ..." | Re-trigger Greptile

@lipikaramaswamy
Copy link
Copy Markdown
Collaborator Author

Once #149 merges, this PR needs a follow-up edit to AGENTS.md:

  1. The dev-vs-user redirect at the top should also mention the bundled skill at skills/anonymizer/
  2. A note that public-API changes (especially imports referenced in skills/anonymizer/SKILL.md's output template) may require corresponding skill updates.

Comment thread docs/concepts/choosing-a-strategy.md Outdated
- The domain (clinical, legal, financial, customer support, etc.)
- The genre (notes, transcripts, opinions, biographies)
- Anything about the source the engine couldn't infer from a single record (e.g. "transcribed phone calls — expect disfluencies")
- *(Augmenter-only)* When you leave `entity_labels=None`, this is currently also the only place to nudge the augmenter LLM **away** from inventing labels you don't care about (e.g. "do not tag generic anatomical terms, medication class names, or job titles as PII"). Treat it as a soft do-not-tag list.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following. I thought entity_labels=None means use the default list and the augmenter can add what ever other entities/labels it deems necessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to say here is that when entity_labels=None, yes the augmenter goes and finds whatever, but the only way to tell it what not to find is by saying it in data_summary. This is what i did for nemotron logs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, rewrote the bullet to lead with what data_summary is for

Comment thread docs/concepts/choosing-a-strategy.md Outdated
- Healthcare: `mrn`, `clinical_facility`, `diagnosis_code`, `medication_name`
- Legal: `case_number`, `court_name`, `docket_number`, `judge_name`
- Customer support: `ticket_id`, `internal_user_id`, `transaction_id`
- Internal: `employee_id`, `cost_center`, `internal_project_codename`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mrn is in the default list, as is employee-id and possibly case_number and court_name

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in de0ab26. Cross-checked all examples against the actual DEFAULT_ENTITY_LABELS list and dropped redundant ones (mrn, court_name, employee_id). Also switched all examples to snake_case to match the convention — the validator only strips/lowercases, so medical record number and medical_record_number would have been treated as different labels.

Note on case_number: verified it's not in DEFAULT_ENTITY_LABELS, so I kept that one

| Is the goal "produce a privacy-safe version of this text that downstream models can train on"? | — | ✅ |
| Are there inferable / latent identifiers that aren't explicitly stated (e.g. "ringing the bell" → cancer treatment)? | ❌ leaves them | ✅ removes them |
| Cost per record | ~1 LLM call (Substitute) or 0 LLM calls (Redact/Annotate/Hash) | Many LLM calls (domain → disposition → QA → rewrite → evaluate → repair → judge) |
| Output text length | ≈ same as input | Often shorter / restructured |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still an LLM call for redact/annotate/hash.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant that to be number of llm calls over and above detection here. But rephrased to clarify.

- **Mode**: Replace (entities only) vs Rewrite (full text transformation that also removes inferable identifiers like "ringing the bell" → cancer)? Default per the rule in `SKILL.md`.
- **For Replace**: which strategy? (`Substitute` for realistic-looking, `Redact` for explicit `[REDACTED_…]` markers, `Hash` for stable cross-row identifiers, `Annotate` for inspection only).
- **For Rewrite**: what must be protected? what must be preserved? how strict (`risk_tolerance`)? Read [`docs/concepts/choosing-a-strategy.md`](../../../docs/concepts/choosing-a-strategy.md) sections 5–6 with the user's answers in mind.
- **Domain-specific entity labels** the defaults won't cover (e.g. `"medical record number"`, `"case number"`, `"internal project codename"`). If yes, read [`docs/concepts/choosing-a-strategy.md`](../../../docs/concepts/choosing-a-strategy.md) section 2.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, medical record number and case number are bad examples

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in de0ab26

Comment thread skills/anonymizer/SKILL.md Outdated
# Usage Tips and Common Pitfalls

- **`Detect.entity_labels=None` (the default) is permissive** — the augmenter LLM may invent labels not in `DEFAULT_ENTITY_LABELS`. Setting an explicit list switches to **strict mode** where *only* the listed labels are detected. To add domain labels, *extend* the default, don't replace it: `entity_labels=DEFAULT_ENTITY_LABELS + ["medical record number", ...]`.
- **GLiNER is zero-shot** — entity labels are natural-language strings (`"medical record number"`, `"internal project codename"`), not codes or enum values. Any label you can describe in English is a label GLiNER can detect.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, medical record number is in the default list

lipikaramaswamy and others added 2 commits May 11, 2026 20:25
Address review feedback from @asteier2026 on PR #153:

- `mrn`, `court_name`, and `employee_id` were listed as examples of
  domain-specific labels to *extend* the default list with, but all
  three are already in DEFAULT_ENTITY_LABELS (as
  `medical_record_number`, `court_name`, and `employee_id`).
- Some examples used space-separated natural-language strings
  ("medical record number", "internal project codename") while the
  default list uses snake_case. Spaces are not normalized by the
  validator, so the two forms would map to different labels.

Swap in non-default snake_case examples (e.g. `clinical_facility`,
`diagnosis_code`, `case_number`, `internal_project_codename`) and add
one sentence noting the snake_case convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
- Rewrite the augmenter-steering bullet in active voice, leading with
  what data_summary actually is for. Drops the "(Augmenter-only)"
  parenthetical that was hiding the point.
- Rename the "Cost per record" comparison row to "Additional LLM calls
  (beyond shared detection)" to make explicit that detection runs in
  both modes and these numbers exclude it — so the 0 for
  Redact/Annotate/Hash is correct (no additional LLM work beyond
  detection).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lipikaramaswamy <lramaswamy@nvidia.com>
Comment on lines +144 to +147
if args.full:
result = anonymizer.run(config=config, data=data)
out_path = result.output_path
print(f"Wrote {len(result.dataframe)} rows to {out_path}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 AnonymizerResult has no output_path attribute — it only exposes dataframe, trace_dataframe, and failed_records (see src/anonymizer/interface/results.py). Accessing result.output_path on the full-run path will raise AttributeError every time the user runs with --full, before any useful output is printed or saved.

Suggested change
if args.full:
result = anonymizer.run(config=config, data=data)
out_path = result.output_path
print(f"Wrote {len(result.dataframe)} rows to {out_path}")
if args.full:
result = anonymizer.run(config=config, data=data)
out_path = "output.parquet" # TODO: change path/format (.csv, .jsonl) as needed
result.dataframe.to_parquet(out_path)
print(f"Wrote {len(result.dataframe)} rows to {out_path}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants