docs: define sensitivity and protection method#151
Conversation
Greptile SummaryThis PR adds a "Key concepts" section to
Confidence Score: 4/5Documentation-only change with no functional impact; safe to merge after addressing the missing sensitivity level details. The added section is accurate and coherent, but the Sensitivity definition doesn't name the discrete levels (high/medium/low) that already appear in the leakage mass formula and output columns table — a reader following the document top-to-bottom will encounter those terms without having been introduced to them. docs/concepts/rewrite.md — specifically the Sensitivity definition and the dense Protection method paragraph. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Entity detected] --> B{Sensitivity assigned\nhigh / medium / low}
B --> C{Requires protection?}
C -- No --> D[Left as-is]
C -- Yes --> E{Entity type}
E -- Direct identifier --> F[Replace with synthetic alternative]
E -- Quasi-identifier --> G[Generalize to broader form\ne.g. date → quarter, city → region]
E -- Latent entity --> H[suppress_inference:\nrewrite surrounding text]
E -- Cannot preserve meaning --> I[Remove outright]
F & G & H & I --> J[Leakage scoring system]
B --> J
Reviews (1): Last reviewed commit: "feature: add docs for sensitivity and pr..." | Re-trigger Greptile |
|
|
||
| ## Key concepts | ||
|
|
||
| **Sensitivity** measures the intrinsic re-identification damage an entity causes if it appears in the output — independently of what else is retained. It is not the protection decision; it feeds the downstream leakage scoring system. |
There was a problem hiding this comment.
Sensitivity definition omits discrete levels used throughout the document
The Sensitivity definition describes the concept in general terms but doesn't mention the concrete levels (high, medium, low) that the rest of the document already relies on. The leakage mass formula further down explicitly references high=1.0, medium=0.6, low=0.3, and the Output columns table references any_high_leaked. A reader encountering those terms after only having read this definition will not know what discrete values exist or how they are distinguished.
|
|
||
| **Sensitivity** measures the intrinsic re-identification damage an entity causes if it appears in the output — independently of what else is retained. It is not the protection decision; it feeds the downstream leakage scoring system. | ||
|
|
||
| **Protection method** describes how a sensitive entity is transformed. The choice reflects a holistic view of the document — what other entities are being protected and how shapes what each individual entity needs. The general defaults are: direct identifiers are replaced with plausible synthetic alternatives, quasi-identifiers are generalized to a broader form (e.g., an exact date becomes a quarter, a city becomes a region), and latent entities receive `suppress_inference`, meaning the surrounding text is rewritten to remove the cues that enable the inference rather than replacing a stated value. Entities that do not require protection are left as-is. Occasionally an entity is removed outright when neither replacement nor generalization can preserve meaning without retaining the identifying detail. |
There was a problem hiding this comment.
The "Protection method" paragraph bundles five distinct transformation rules into a single long run-on sentence. Splitting the default behaviors into a bullet list makes each rule scannable and easier to compare at a glance.
| **Protection method** describes how a sensitive entity is transformed. The choice reflects a holistic view of the document — what other entities are being protected and how shapes what each individual entity needs. The general defaults are: direct identifiers are replaced with plausible synthetic alternatives, quasi-identifiers are generalized to a broader form (e.g., an exact date becomes a quarter, a city becomes a region), and latent entities receive `suppress_inference`, meaning the surrounding text is rewritten to remove the cues that enable the inference rather than replacing a stated value. Entities that do not require protection are left as-is. Occasionally an entity is removed outright when neither replacement nor generalization can preserve meaning without retaining the identifying detail. | |
| **Protection method** describes how a sensitive entity is transformed. The choice reflects a holistic view of the document — what other entities are being protected and how shapes what each individual entity needs. The general defaults are: | |
| - **Direct identifiers** are replaced with plausible synthetic alternatives. | |
| - **Quasi-identifiers** are generalized to a broader form (e.g., an exact date becomes a quarter, a city becomes a region). | |
| - **Latent entities** receive `suppress_inference`: the surrounding text is rewritten to remove the cues that enable the inference rather than replacing a stated value. | |
| - **Low-risk entities** that do not require protection are left as-is. | |
| - Occasionally an entity is **removed outright** when neither replacement nor generalization can preserve meaning without retaining the identifying detail. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Changes include: