Skip to content

Latest commit

 

History

History
192 lines (135 loc) · 5.46 KB

File metadata and controls

192 lines (135 loc) · 5.46 KB

Anonymization

Cosmian KMS exposes a set of data anonymization methods through a dedicated REST endpoint. These methods are stateless, require no cryptographic key, and are available under /tokenize/{method} on any KMS instance built with the non-fips feature.

All endpoints accept a JSON body and return a JSON object with a single result field. Errors come back as HTTP 422 with { "code": 422, "message": "..." }.

The ckms tokenize CLI command wraps each of these endpoints if you prefer working from the command line.

[TOC]

Endpoints

Hash

POST /tokenize/hash

Produces an irreversible, base64-encoded hash of a string. Three algorithms are available: SHA2 (SHA-256), SHA3 (SHA3-256), and Argon2. A salt can be passed as a base64-encoded byte string; for Argon2 it is required.

{
  "data": "test sha2",
  "method": "SHA2"
}

Response:

{ "result": "Px0txVYqBePXWF5K4xFn0Pa2mhnYA/jfsLtpIF70vJ8=" }

Use Argon2 when the input has low entropy (passwords, short identifiers) and you need resistance to brute-force lookup. Use SHA2 or SHA3 for high-entropy values or when throughput matters.

Noise

POST /tokenize/noise

Adds random noise to a numeric or date value. Three distributions are supported: Gaussian, Laplace, and Uniform. You can parameterize the distribution either by mean and standard deviation, or by lower and upper bounds.

The data_type field controls what kind of input to expect: float, integer, or date. Dates must be in RFC 3339 format (2023-04-07T12:34:56Z); the timezone is preserved in the output.

{
  "data": 42000.0,
  "data_type": "float",
  "method": "Laplace",
  "mean": 0.0,
  "std_dev": 500.0
}
{
  "data": "2023-04-07T12:34:56+02:00",
  "data_type": "date",
  "method": "Uniform",
  "min_bound": -3600.0,
  "max_bound": 3600.0
}

Laplace noise is the standard choice for differential privacy budgets. Uniform noise is appropriate when you need a hard bound on the perturbation. Gaussian is a reasonable default for datasets where outliers are acceptable.

Word Mask

POST /tokenize/word-mask

Replaces every occurrence of a listed word with XXXX, case-insensitively, matching only on word boundaries.

{
  "data": "Confidential: contains -secret- documents",
  "words": ["confidential", "secret"]
}
{ "result": "XXXX: contains -XXXX- documents" }

Non-word characters surrounding a target word (dashes, punctuation, spaces) are preserved.

Word Tokenize

POST /tokenize/word-tokenize

Replaces listed words with random 16-byte hex tokens. All occurrences of the same word receive the same token within a single request, so joins across a single document remain consistent. Across separate requests, the tokens are different.

{
  "data": "confidential meeting with confidential sources",
  "words": ["confidential"]
}

The output might be:

{ "result": "3A9F1C8B2E7D4A60 meeting with 3A9F1C8B2E7D4A60 sources" }

Word Pattern Mask

POST /tokenize/word-pattern-mask

Replaces all substrings matching a regular expression with a fixed replacement string. The pattern is limited to 1024 characters.

{
  "data": "Call me at +33 6 12 34 56 78 or +1-555-123-4567",
  "pattern": "\\+[\\d\\s\\-]+",
  "replace": "[PHONE]"
}
{ "result": "Call me at [PHONE] or [PHONE]" }

Aggregate Number

POST /tokenize/aggregate-number

Rounds a number to the nearest power of ten. power_of_ten: 2 rounds to the nearest hundred; power_of_ten: -1 keeps one decimal place.

{
  "data": 1234567,
  "data_type": "integer",
  "power_of_ten": 3
}
{ "result": "1235000" }

data_type is either float or integer.

Aggregate Date

POST /tokenize/aggregate-date

Truncates a date to the specified precision. The time_unit field accepts Second, Minute, Hour, Day, Month, or Year. The timezone offset is preserved.

{
  "data": "2023-04-07T12:34:56+02:00",
  "time_unit": "Month"
}
{ "result": "2023-04-01T00:00:00+02:00" }

Scale Number

POST /tokenize/scale-number

Applies a z-score normalization followed by a linear transformation. Given the mean and standard deviation of the original distribution, each value is standardized to zero mean and unit variance, then multiplied by scale and shifted by translate.

{
  "data": 75.0,
  "data_type": "float",
  "mean": 50.0,
  "std_deviation": 15.0,
  "scale": 10.0,
  "translate": 100.0
}
{ "result": 116.6666666666667 }

data_type is either float or integer. For integers the result is rounded to the nearest integer.

Relation to Format-Preserving Encryption

The methods above are one-way or statistically approximate transformations. They do not require a key and cannot be reversed to recover the original value.

For reversible, key-based field-level encryption that preserves the format of the original value (digit strings stay digit strings, alphanumeric stays alphanumeric), see Format-Preserving Encryption (FPE FF1). FPE is accessed through the KMIP Encrypt and Decrypt operations, not through the /tokenize endpoint.

Property Anonymization (/tokenize) FPE (KMIP)
Reversible No Yes
Requires a key No Yes (256-bit AES)
Format-preserving Partial (noise, rounding) Exact
FIPS mode No No
Use case Analytics, data sharing Tokenized storage, round-trip