From 05ffdbda66aba1e6436ff2bddb8751bfc82a6e5b Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 18 Oct 2025 14:12:27 +0000 Subject: [PATCH 1/5] Initial plan From 6fd59279fa1a52c927647ee7f8a77c50eda8a1a0 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 18 Oct 2025 14:19:48 +0000 Subject: [PATCH 2/5] Improve README English: intro, capabilities, config, flags, and mask descriptions Co-authored-by: adrienaury <44274230+adrienaury@users.noreply.github.com> --- README.md | 197 +++++++++++++++++++++++++++--------------------------- 1 file changed, 98 insertions(+), 99 deletions(-) diff --git a/README.md b/README.md index c2bb7af3..88f1cf08 100755 --- a/README.md +++ b/README.md @@ -8,27 +8,27 @@ # PIMO : Private Input, Masked Output -PIMO is a tool for data masking. It can mask data from a JSONline stream and return another JSONline stream thanks to a masking configuration contained in a yaml file. +PIMO is a tool for data masking. It can mask data from a JSONline stream and return another JSONline stream using a masking configuration contained in a YAML file. ![pimo](doc/img/pimo.png) -You can use [LINO](https://github.com/CGI-FR/LINO) to extract sample data from a database, which you can then use as input data for PIMO's data masking. -You can also generate data with a simple yaml configuration file. +You can use [LINO](https://github.com/CGI-FR/LINO) to extract sample data from a database and then use it as input for PIMO's data masking. +You can also generate data using a simple YAML configuration file. **Capabilities** -- credibility : generated data is not distinguishable from real data -- data synthesis : generate data from nothing -- data masking, including - - randomization : protect personal or sensitive data by writing over it - - pseudonymization, on 3 levels - - consistent pseudonymisation : real value A is always replaced by pseudo-value X but X can be attributed to other values than A - - identifiant pseudonymisation : real value A is always replaced by pseudo-value X and X *CANNOT* be attributed to other values than A - - reversible pseudonymisation : real value A can be generated from pseudo-value X +- credibility: generated data is indistinguishable from real data +- data synthesis: generate data from scratch +- data masking, including: + - randomization: protect personal or sensitive data by overwriting it + - pseudonymization, at 3 levels: + - consistent pseudonymization: real value A is always replaced by pseudo-value X, but X can be attributed to other values besides A + - identifier pseudonymization: real value A is always replaced by pseudo-value X, and X *CANNOT* be attributed to any values other than A + - reversible pseudonymization: real value A can be generated from pseudo-value X -## Configuration file needed +## Configuration file required -PIMO requires a yaml configuration file to works. By default, the file is named `masking.yml` and is placed in the working directory. The file must respect the following format : +PIMO requires a YAML configuration file to work. By default, the file is named `masking.yml` and is located in the working directory. The file must follow this format: ```yaml version: "1" @@ -63,18 +63,17 @@ caches: ``` `version` is the version of the masking file. -`seed` is to give every random mask the same seed, it is optional and if it is not defined, the seed is derived from the current time to increase randomness. -`functions` is used to define the functions that can be used in the te mask `template`, `template-each`, `add`, and `add-transient`. -`masking` is used to define the pipeline of masks that is going to be applied. -`selector` is made of a jsonpath and a mask. -`jsonpath` defines the path of the entry that has to be masked in the json file. -`mask` defines the mask that will be used for the entry defined by `selector`. -`cache` is optional, if the current entry is already in the cache as key the associated value is returned without executing the mask. Otherwise the mask is executed and a new entry is added in the cache with the orignal content as `key` and the masked result as `value`. The cache have to be declared in the `caches` section of the YAML file. -`preserve` is optional, and is used to keep some values unmasked in the json file. Allowed `preserve` options are: `"null"` (null values), `"empty"` (empty string `""`), and -`"blank"` (both `empty` and `null` values). Additionally, `preserve` can be used with mask [`fromCache`](#fromCache) to preserve uncached values. (usage: `preserve: "notInCache"`) +`seed` provides a seed value for all random masks. It is optional; if not defined, the seed is derived from the current time to increase randomness. +`functions` defines custom functions that can be used in the `template`, `template-each`, `add`, and `add-transient` masks. +`masking` defines the pipeline of masks to be applied. +`selector` consists of a jsonpath and a mask. +`jsonpath` defines the path of the entry to be masked in the JSON file. +`mask` defines the mask to be used for the entry specified by `selector`. +`cache` is optional. If the current entry already exists in the cache as a key, the associated value is returned without executing the mask. Otherwise, the mask is executed and a new entry is added to the cache with the original content as the `key` and the masked result as the `value`. The cache must be declared in the `caches` section of the YAML file. +`preserve` is optional and is used to keep certain values unmasked in the JSON file. Allowed `preserve` options are: `"null"` (null values), `"empty"` (empty string `""`), and `"blank"` (both `empty` and `null` values). Additionally, `preserve` can be used with the [`fromCache`](#fromCache) mask to preserve uncached values. (usage: `preserve: "notInCache"`) `preserve-list` is optional and is used to ignore specific values -Multiple masks can be applied on the same jsonpath location, like in this example : +Multiple masks can be applied to the same jsonpath location, as shown in this example: ```yaml - selector: @@ -85,7 +84,7 @@ Multiple masks can be applied on the same jsonpath location, like in this exampl - remove: true ``` -Masks can be applied on multiple selectors, like in this example: +Masks can be applied to multiple selectors, as shown in this example: ```yaml - selectors: @@ -95,7 +94,7 @@ Masks can be applied on multiple selectors, like in this example: add: "hello" ``` -It is possible to define functions and reuse them later in the masks, like in this example: +It is possible to define functions and reuse them later in masks, as shown in this example: ```yaml functions: @@ -123,88 +122,88 @@ masking: ## Possible masks -The following types of masks can be used : +The following types of masks can be used: * Pure randomization masks - * [`regex`](#regex) is to mask using a regular expression given in argument. - * [`randomInt`](#randomint) is to mask with a random int from a range with arguments min and max. - * [`randomDecimal`](#randomdecimal) is to mask with a random decimal from a range with arguments min, max and precision. - * [`randDate`](#randdate) is to mask a date with a random date between `dateMin` and `dateMax`. - * [`randomDuration`](#randomduration) is to mask a date by adding or removing a random time between `Min` and `Max`. - * [`randomChoice`](#randomchoice) is to mask with a random value from a list in argument. - * [`weightedChoice`](#weightedchoice) is to mask with a random value from a list with probability, both given with the arguments `choice` and `weight`. - * [`randomChoiceInUri`](#randomchoiceinuri) is to mask with a random value from an external resource. - * [`randomChoiceInCSV`](#randomchoiceincsv) is to mask with a random value from an external CSV resource. - * [`transcode`](#transcode) is to mask a value randomly with character class preservation. - * [`timeline`](#timeline) to generate a set of dates related to each other (by rules and constraints) + * [`regex`](#regex) masks using a regular expression provided as an argument. + * [`randomInt`](#randomint) masks with a random integer from a range with arguments min and max. + * [`randomDecimal`](#randomdecimal) masks with a random decimal from a range with arguments min, max, and precision. + * [`randDate`](#randdate) masks a date with a random date between `dateMin` and `dateMax`. + * [`randomDuration`](#randomduration) masks a date by adding or removing a random time between `Min` and `Max`. + * [`randomChoice`](#randomchoice) masks with a random value from a list provided as an argument. + * [`weightedChoice`](#weightedchoice) masks with a random value from a list with probabilities, both provided with the arguments `choice` and `weight`. + * [`randomChoiceInUri`](#randomchoiceinuri) masks with a random value from an external resource. + * [`randomChoiceInCSV`](#randomchoiceincsv) masks with a random value from an external CSV resource. + * [`transcode`](#transcode) masks a value randomly while preserving character classes. + * [`timeline`](#timeline) generates a set of dates related to each other (by rules and constraints) * K-Anonymization - * [`range`](#range) is to mask a integer value by a range of value (e.g. replace `5` by `[0,10]`). - * [`duration`](#duration) is to mask a date by adding or removing a certain number of days. + * [`range`](#range) masks an integer value with a range of values (e.g., replace `5` with `[0,10]`). + * [`duration`](#duration) masks a date by adding or removing a certain number of days. * Re-identification and coherence preservation - * [`hash`](#hash) is to mask with a value from a list by matching the original value, allowing to mask a value the same way every time. - * [`hashInUri`](#hashinuri) is to mask with a value from an external resource, by matching the original value, allowing to mask a value the same way every time. - * [`hashInCSV`](#hashincsv) is to mask with a value from an external CSV resource, by matching the original value, allowing to mask a value the same way every time. - * [`fromCache`](#fromcache) is a mask to obtain a value from a cache. - * [`ff1`](#ff1) mask allows the use of FPE which enable private-key based re-identification. - * [`sha3`](#sha3) masks will apply a variable length cryptographic hash (SHAKE variable-output-length hash function defined by FIPS-202) and then apply a base-conversion to the output. + * [`hash`](#hash) masks with a value from a list by matching the original value, allowing the same value to be masked consistently. + * [`hashInUri`](#hashinuri) masks with a value from an external resource by matching the original value, allowing the same value to be masked consistently. + * [`hashInCSV`](#hashincsv) masks with a value from an external CSV resource by matching the original value, allowing the same value to be masked consistently. + * [`fromCache`](#fromcache) obtains a value from a cache. + * [`ff1`](#ff1) enables the use of FPE, which enables private-key based re-identification. + * [`sha3`](#sha3) applies a variable-length cryptographic hash (SHAKE variable-output-length hash function defined by FIPS-202) and then applies a base conversion to the output. * Formatting - * [`dateParser`](#dateparser) is to change a date format. - * [`template`](#template) is to mask a data with a template using other values from the jsonline. - * [`template-each`](#template-each) is like template but will apply on each value of an array. - * [`fromjson`](#fromjson) is to convert string field values to parsed JSON, e.g. "[1,2,3]" -> [1,2,3]. + * [`dateParser`](#dateparser) changes a date format. + * [`template`](#template) masks data using a template with other values from the jsonline. + * [`template-each`](#template-each) is like template but applies to each value of an array. + * [`fromjson`](#fromjson) converts string field values to parsed JSON, e.g., "[1,2,3]" -> [1,2,3]. * Data structure manipulation - * [`remove`](#remove) is to mask a field by completely removing it. - * [`add`](#add) is a mask to add a field to the jsonline. - * [`add-transient`](#add-transient) same as `add` but the field is not exported in the output jsonline. + * [`remove`](#remove) masks a field by completely removing it. + * [`add`](#add) adds a field to the jsonline. + * [`add-transient`](#add-transient) is the same as `add`, but the field is not exported in the output jsonline. * Others - * [`constant`](#constant) is to mask the value by a constant value given in argument. - * [`command`](#command) is to mask with the output of a console command given in argument. - * [`incremental`](#incremental) is to mask data with incremental value starting from `start` with a step of `increment`. - * [`sequence`](#sequence) generate sequenced IDs of any format. - * [`fluxUri`](#fluxuri) is to replace by a sequence of values defined in an external resource. - * [`replacement`](#replacement) is to mask a data with another data from the jsonline. - * [`pipe`](#pipe) is a mask to handle complex nested array structures, it can read an array as an object stream and process it with a sub-pipeline. - * [`apply`](#apply) process selected data with a sub-pipeline. - * [`partitions`](#partitions) will rely on conditions to identify specific cases. - * [`segments`](#segments) allow transformations on specific parts of a field's value using regular expressions subgroups captures. - * [`luhn`](#luhn) can generate valid numbers using the Luhn algorithm (e.g. french SIRET or SIREN). - * [`markov`](#markov) can generate pseudo text based on a sample text. - * [`findInCSV`](#findincsv) get one or multiple csv lines which matched with Json entry value from CSV files. - * [`xml`](#xml) can manipulate XML content within JSON values. - * [`log`](#log) will output a log. - -A full `masking.yml` file example, using every kind of mask, is given with the source code. - -In case two types of mask are entered with the same selector, the program can't extract the masking configuration and will return an error. The file `wrongMasking.yml` provided with the source illustrate that error. + * [`constant`](#constant) masks the value with a constant value provided as an argument. + * [`command`](#command) masks with the output of a console command provided as an argument. + * [`incremental`](#incremental) masks data with an incremental value starting from `start` with a step of `increment`. + * [`sequence`](#sequence) generates sequenced IDs of any format. + * [`fluxUri`](#fluxuri) replaces with a sequence of values defined in an external resource. + * [`replacement`](#replacement) masks data with another data field from the jsonline. + * [`pipe`](#pipe) handles complex nested array structures; it can read an array as an object stream and process it with a sub-pipeline. + * [`apply`](#apply) processes selected data with a sub-pipeline. + * [`partitions`](#partitions) relies on conditions to identify specific cases. + * [`segments`](#segments) allows transformations on specific parts of a field's value using regular expression subgroup captures. + * [`luhn`](#luhn) generates valid numbers using the Luhn algorithm (e.g., French SIRET or SIREN). + * [`markov`](#markov) generates pseudo text based on a sample text. + * [`findInCSV`](#findincsv) retrieves one or multiple CSV lines that match the JSON entry value from CSV files. + * [`xml`](#xml) manipulates XML content within JSON values. + * [`log`](#log) outputs a log. + +A full `masking.yml` file example, using every kind of mask, is provided with the source code. + +If two mask types are entered with the same selector, the program cannot extract the masking configuration and will return an error. The file `wrongMasking.yml` provided with the source code illustrates this error. ## Usage -To use PIMO to mask a `data.json`, use in the following way : +To use PIMO to mask a `data.json` file, use the following command: ```bash ./pimo maskedData.json ``` -This takes the `data.json` file, masks the data contained inside it and put the result in a `maskedData.json` file. If data are in a table (for example multiple names), then each field of this table will be masked using the given mask. The following flags can be used: +This takes the `data.json` file, masks the data contained in it, and outputs the result to a `maskedData.json` file. If the data is in an array (for example, multiple names), then each field in the array will be masked using the specified mask. The following flags can be used: -* `--repeat=N` This flag will make pimo mask every input N-times (useful for dataset generation). -* `--skip-line-on-error` This flag will totally skip a line if an error occurs masking a field. -* `--skip-field-on-error` This flag will return output without a field if an error occurs masking this field. +* `--repeat=N` This flag makes PIMO mask every input N times (useful for dataset generation). +* `--skip-line-on-error` This flag skips a line entirely if an error occurs while masking a field. +* `--skip-field-on-error` This flag returns output without a field if an error occurs while masking that field. * `--skip-log-file ` Skipped lines will be written to ``. * `--catch-errors ` or `-e ` Equivalent to `--skip-line-on-error --skip-log-file `. -* `--empty-input` This flag will give PIMO a `{}` input, usable with `--repeat` flag. -* `--config=filename.yml` This flag allow to use another file for config than the default `masking.yml`. -* `--load-cache cacheName=filename.json` This flag load an initial cache content from a file (json line format `{"key":"a", "value":"b"}`). -* `--dump-cache cacheName=filename.json` This flag dump final cache content to a file (json line format `{"key":"a", "value":"b"}`). -* `--verbosity ` or `-v` This flag increase verbosity on the stderr output, possible values: none (0), error (1), warn (2), info (3), debug (4), trace (5). -* `--debug` This flag complete the logs with debug information (source file, line number). -* `--log-json` Set this flag to produce JSON formatted logs ([demo9](demo/demo9) goes deeper into logging and structured logging) -* `--seed ` Set this flage to declare seed in command line. -* `--mask` Declare a simple masking definition in command line (minified YAML format: `--mask "value={fluxUri: 'pimo://nameFR'}"`, or `--mask "value=[{add: ''},{fluxUri: 'pimo://nameFR'}]"` for multiple masks). For advanced use case (e.g. if caches needed) `masking.yml` file definition will be preferred. -* `--repeat-until ` This flag will make PIMO keep masking every input until the condition is met. Condition format is using [Template](https://pkg.go.dev/text/template). Last output verifies the condition. -* `--repeat-while ` This flag will make PIMO keep masking every input while the condition is met. Condition format is using [Template](https://pkg.go.dev/text/template). -* `--stats ` This flag either outputs run statistics to the specified file or send them to specified url (has to start with `http` or `https`). -* `--statsTemplate ` This flag will have PIMO use the value as a template to generate statistics. Please use go templating format to include statistics. To include them you have to specify them as `{{ .Stats }}`. (i.e. `{"software":"PIMO","stats":{{ .Stats }}}`) +* `--empty-input` This flag provides PIMO with a `{}` input, usable with the `--repeat` flag. +* `--config=filename.yml` This flag allows using another file for configuration instead of the default `masking.yml`. +* `--load-cache cacheName=filename.json` This flag loads an initial cache content from a file (JSON line format `{"key":"a", "value":"b"}`). +* `--dump-cache cacheName=filename.json` This flag dumps the final cache content to a file (JSON line format `{"key":"a", "value":"b"}`). +* `--verbosity ` or `-v` This flag increases verbosity on the stderr output. Possible values: none (0), error (1), warn (2), info (3), debug (4), trace (5). +* `--debug` This flag completes the logs with debug information (source file, line number). +* `--log-json` Set this flag to produce JSON-formatted logs ([demo9](demo/demo9) goes deeper into logging and structured logging). +* `--seed ` Set this flag to declare a seed on the command line. +* `--mask` Declare a simple masking definition on the command line (minified YAML format: `--mask "value={fluxUri: 'pimo://nameFR'}"`, or `--mask "value=[{add: ''},{fluxUri: 'pimo://nameFR'}]"` for multiple masks). For advanced use cases (e.g., if caches are needed), the `masking.yml` file definition is preferred. +* `--repeat-until ` This flag makes PIMO keep masking every input until the condition is met. Condition format uses [Template](https://pkg.go.dev/text/template). The last output verifies the condition. +* `--repeat-while ` This flag makes PIMO keep masking every input while the condition is met. Condition format uses [Template](https://pkg.go.dev/text/template). +* `--stats ` This flag either outputs run statistics to the specified file or sends them to the specified URL (must start with `http` or `https`). +* `--statsTemplate ` This flag makes PIMO use the value as a template to generate statistics. Use Go templating format to include statistics. To include them, specify them as `{{ .Stats }}`. (e.g., `{"software":"PIMO","stats":{{ .Stats }}}`) ### PIMO Play @@ -236,7 +235,7 @@ Please check the [demo folder](demo) for more advanced examples. regex: "0[1-7]( ([0-9]){2}){4}" ``` -This example will mask the `phone` field of the input jsonlines with a random string respecting the regular expression. +This example will mask the `phone` field of the input jsonlines with a random string that matches the regular expression. [Return to list of masks](#possible-masks) @@ -251,7 +250,7 @@ This example will mask the `phone` field of the input jsonlines with a random st constant: "Bill" ``` -This example will mask the `name` field of the input jsonlines with the value of the `constant` field. +This example will mask the `name` field of the input jsonlines with the value specified in the `constant` field. [Return to list of masks](#possible-masks) @@ -284,9 +283,9 @@ This example will mask the `name` field of the input jsonlines with random value randomChoiceInUri: "file://names.txt" ``` -This example will mask the `name` field of the input jsonlines with random values from the list contained in the name.txt file. The different URI usable with this selector are : `pimo`, `file` and `http`/`https`. +This example will mask the `name` field of the input jsonlines with random values from the list contained in the names.txt file. The different URI schemes supported by this selector are: `pimo`, `file`, and `http`/`https`. -A value can be injected in URI with the template syntax. For example, `file://name{{.gender}}.txt` select a line in `name_F.txt` if the current jsonline is `{gender : "F"}`. +A value can be injected into the URI using template syntax. For example, `file://name{{.gender}}.txt` selects a line from `name_F.txt` if the current jsonline is `{gender : "F"}`. [Return to list of masks](#possible-masks) @@ -339,7 +338,7 @@ Here is a detailed breakdown of the example configuration: max: 32 ``` -This example will mask the `age` field of the input jsonlines with a random number between `min` and `max` included. +This example will mask the `age` field of the input jsonlines with a random number between `min` and `max`, inclusive. [Return to list of masks](#possible-masks) @@ -357,7 +356,7 @@ This example will mask the `age` field of the input jsonlines with a random numb precision: 2 ``` -This example will mask the `score` field of the input jsonlines with a random float between `min` and `max`, with the number of decimal chosen in the `precision` field. +This example will mask the `score` field of the input jsonlines with a random float between `min` and `max`, with the number of decimal places specified in the `precision` field. [Return to list of masks](#possible-masks) @@ -389,7 +388,7 @@ This example will mask the `name` field of the input jsonlines with the output o weight: 1 ``` -This example will mask the `surname` field of the input jsonlines with a random value in the `weightedChoice` list with a probability proportional at the `weight` field. +This example will mask the `surname` field of the input jsonlines with a random value from the `weightedChoice` list, with a probability proportional to the `weight` field. [Return to list of masks](#possible-masks) @@ -407,7 +406,7 @@ This example will mask the `surname` field of the input jsonlines with a random - "Sapphire City" ``` -This example will mask the `town` field of the input jsonlines with a value from the `hash` list. The value will be chosen thanks to a hashing of the original value, allowing the output to be always the same in case of identical inputs. +This example will mask the `town` field of the input jsonlines with a value from the `hash` list. The value will be chosen based on a hash of the original value, ensuring that identical inputs always produce the same output. [Return to list of masks](#possible-masks) @@ -422,7 +421,7 @@ This example will mask the `town` field of the input jsonlines with a value from hashInUri: "pimo://nameFR" ``` -This example will mask the `name` field of the input jsonlines with a value from the list nameFR contained in pimo, the same way as for `hash` mask. The different URI usable with this selector are : `pimo`, `file` and `http`/`https`. +This example will mask the `name` field of the input jsonlines with a value from the nameFR list contained in PIMO, in the same way as the `hash` mask. The different URI schemes supported by this selector are: `pimo`, `file`, and `http`/`https`. [Return to list of masks](#possible-masks) @@ -446,7 +445,7 @@ masking: trim: true # optional: trim space in values and headers, default: false ``` -The selected field's data will be masked with random values selected from a CSV file available at the specified URL (a GitHub gist in this case). The value will be chosen thanks to a hashing of the original value, allowing the output to be always the same in case of identical inputs. +The selected field's data will be masked with random values selected from a CSV file available at the specified URL (a GitHub gist in this case). The value will be chosen based on a hash of the original value, ensuring that identical inputs always produce the same output. See [RandomChoiceInCSV](#randomchoiceincsv) for a detailed breakdown of the example configuration. From bd9cffe6b2111c45d9f06a5578da5a6d41d6ef87 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 18 Oct 2025 14:23:20 +0000 Subject: [PATCH 3/5] Improve README English: mask examples and descriptions (duration through sha3) Co-authored-by: adrienaury <44274230+adrienaury@users.noreply.github.com> --- README.md | 68 +++++++++++++++++++++++++++---------------------------- 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 88f1cf08..4e448f09 100755 --- a/README.md +++ b/README.md @@ -479,7 +479,7 @@ This example will mask the `date` field of the input jsonlines with a random dat duration: "-P2D" ``` -This example will mask the `last_contact` field of the input jsonlines by decreasing its value by 2 days. The duration field should match the ISO 8601 standard for durations. +This example will mask the `last_contact` field of the input jsonlines by decreasing its value by 2 days. The duration field must conform to the ISO 8601 standard for durations. [Return to list of masks](#possible-masks) @@ -496,7 +496,7 @@ This example will mask the `last_contact` field of the input jsonlines by decrea outputFormat: "01/02/06" ``` -This example will change every date from the date field from the `inputFormat` to the `outputFormat`. The format should always display the following date : `Mon Jan 2 15:04:05 -0700 MST 2006`. Either field is optional and in case a field is not defined, the default format is RFC3339, which is the base format for PIMO, needed for `duration` mask and given by `randDate` mask. It is possible to use the Unix time format by specifying `inputFormat: "unixEpoch"` or `outputFormat: "unixEpoch"`. +This example will convert every date in the date field from the `inputFormat` to the `outputFormat`. The format must always display the following reference date: `Mon Jan 2 15:04:05 -0700 MST 2006`. Both fields are optional; if a field is not defined, the default format is RFC3339, which is the base format for PIMO, required by the `duration` mask and produced by the `randDate` mask. You can use the Unix time format by specifying `inputFormat: "unixEpoch"` or `outputFormat: "unixEpoch"`. [Return to list of masks](#possible-masks) @@ -530,7 +530,7 @@ This example will mask the `date` field of the input jsonlines by decreasing its increment: 1 ``` -This example will mask the `id` field of the input jsonlines with incremental values. The first jsonline's `id` will be masked by 1, the second's by 2, etc... +This example will mask the `id` field of the input jsonlines with incremental values. The first jsonline's `id` will be masked to 1, the second to 2, and so on. [Return to list of masks](#possible-masks) @@ -546,9 +546,9 @@ This example will mask the `id` field of the input jsonlines with incremental va format: "ERR-0000" ``` -This example will generate the `id` field of the input jsonlines with sequenced values. The first jsonline's `id` will be masked by `ERR-0000`, the second's by `ERR-0001`, etc... +This example will generate the `id` field of the input jsonlines with sequenced values. The first jsonline's `id` will be masked to `ERR-0000`, the second to `ERR-0001`, and so on. -By default, the varying part of the ID is numbers, but this can be changed : +By default, the varying part of the ID consists of numbers, but this can be changed: ```yaml - selector: @@ -559,7 +559,7 @@ By default, the varying part of the ID is numbers, but this can be changed : varying: "ER" ``` -With this configuration, the first jsonline's `id` will be masked by `EEE-0000`, the second's by `EER-0000`, the third by `ERE-0000` etc... +With this configuration, the first jsonline's `id` will be masked to `EEE-0000`, the second to `EER-0000`, the third to `ERE-0000`, and so on. [Return to list of masks](#possible-masks) @@ -574,7 +574,7 @@ With this configuration, the first jsonline's `id` will be masked by `EEE-0000`, replacement: "name" ``` -This example will mask the `name4` field of the input jsonlines with the field `name` of the jsonline. This selector must be placed after the `name` selector to be masked with the new value and it must be placed before the `name` selector to be masked by the previous value. +This example will mask the `name4` field of the input jsonlines with the value of the `name` field from the jsonline. This selector must be placed after the `name` selector to use the masked value, or before the `name` selector to use the original value. [Return to list of masks](#possible-masks) @@ -589,7 +589,7 @@ This example will mask the `name4` field of the input jsonlines with the field ` template: "{{.surname}}.{{.name}}@gmail.com" ``` -This example will mask the `mail` field of the input jsonlines respecting the given template. In the `masking.yml` config file, this selector must be placed after the fields contained in the template to mask with the new values and before the other fields to be masked with the old values. In the case of a nested json, the template must respect the following example : +This example will mask the `mail` field of the input jsonlines using the given template. In the `masking.yml` config file, this selector must be placed after the fields referenced in the template to use the masked values, or before them to use the original values. For nested JSON structures, the template must follow this example: ```yaml - selector: @@ -598,9 +598,9 @@ This example will mask the `mail` field of the input jsonlines respecting the gi template: "{{.user.surname}}.{{.user.name}}@gmail.com" ``` -The format for the template should respect the `text/template` package : +The template format must comply with the `text/template` package: -The template mask can format the fields used. The following example will create a mail address without accent or upper case: +The template mask can format the fields used. The following example will create an email address without accents or uppercase letters: ```yaml - selector: @@ -636,12 +636,12 @@ Most masks will be available as functions in template in the form : MaskCapitali item: "value" ``` -This will affect every values in the array field. The field must be an array (`{"array": ["value1", "value2"]}`). -The `item` property is optional and defines the name of the current item in the templating string (defaults to "it"). There is another optional property `index`, if defined then a property with the given name will be available in the templating string (e.g. : `index: "idx"` can be used in template with `{{.idx}}`). +This will apply to every value in the array field. The field must be an array (`{"array": ["value1", "value2"]}`). +The `item` property is optional and defines the name of the current item in the template string (defaults to "it"). There is another optional property, `index`; if defined, a property with the given name will be available in the template string (e.g., `index: "idx"` can be used in the template as `{{.idx}}`). -The format for the template should respect the `text/template` package : +The template format must comply with the `text/template` package: -See also the [Template mask](#template) for other options, all functions are applicable on template-each. +See also the [Template mask](#template) for other options; all functions are applicable to template-each. [Return to list of masks](#possible-masks) @@ -656,7 +656,7 @@ See also the [Template mask](#template) for other options, all functions are app fromjson: "sourcefield" ``` -This example will mask the `targetfield` field of the input jsonlines with the parsed JSON from field `sourcefield` of the jsonline. This mask changes the type of the input string (`sourcefield`) : +This example will mask the `targetfield` field of the input jsonlines with the parsed JSON from the `sourcefield` field of the jsonline. This mask changes the type of the input string (`sourcefield`): * null : nil * string: string @@ -678,7 +678,7 @@ This example will mask the `targetfield` field of the input jsonlines with the p remove: true ``` -This field will mask the `useless-field` of the input jsonlines by completely deleting it. +This example will mask the `useless-field` of the input jsonlines by completely deleting it. [Return to list of masks](#possible-masks) @@ -693,11 +693,11 @@ This field will mask the `useless-field` of the input jsonlines by completely de add: "newvalue" ``` -This example will create the field `newField` containing the value `newvalue`. This value can be a string, a number, a boolean... +This example will create the field `newField` containing the value `newvalue`. This value can be a string, a number, a boolean, etc. -The field will be created in every input jsonline that doesn't already contains this field. +The field will be created in every input jsonline that doesn't already contain this field. -Note: add can contains template strings (see the [Template](#template) mask for more information). +Note: add can contain template strings (see the [Template](#template) mask for more information). [Return to list of masks](#possible-masks) @@ -712,13 +712,13 @@ Note: add can contains template strings (see the [Template](#template) mask for add-transient: "newvalue" ``` -This example will create the field `newField` containing the value `newvalue`. This value can be a string, a number, a boolean... It can also be a [template](#template). +This example will create the field `newField` containing the value `newvalue`. This value can be a string, a number, a boolean, etc. It can also be a [template](#template). -The field will be created in every input jsonline that doesn't already contains this field, and it will be removed from the final JSONLine output. +The field will be created in every input jsonline that doesn't already contain this field, and it will be removed from the final JSONLine output. -This mask is used for temporary field that is only available to other fields during the execution. +This mask is used for temporary fields that are only available to other fields during execution. -Note: add-transient can contains template strings (see the [Template](#template) mask for more information). +Note: add-transient can contain template strings (see the [Template](#template) mask for more information). [Return to list of masks](#possible-masks) @@ -733,7 +733,7 @@ Note: add-transient can contains template strings (see the [Template](#template) fluxURI: "file://id.csv" ``` -This example will create an `id` field in every output jsonline. The values will be the ones contained in the `id.csv` file in the same order as in the file. If the field already exist on the input jsonline it will be replaced and if every value of the file has already been assigned, the input jsonlines won't be modified. +This example will create an `id` field in every output jsonline. The values will be the ones contained in the `id.csv` file in the same order as in the file. If the field already exists in the input jsonline, it will be replaced. If all values from the file have been assigned, the input jsonlines will not be modified. [Return to list of masks](#possible-masks) @@ -750,9 +750,9 @@ This example will create an `id` field in every output jsonline. The values will reverse: false ``` -This example will replace the content of `id` field by the matching content in the cache `fakeId`. Cache have to be declared in the `caches` section. -Cache content can be loaded from jsonfile with the `--load-cache fakeId=fakeId.jsonl` option or by the `cache` option on another field. -If no matching is found in the cache, `fromCache` block the current line and the next lines are processing until a matching content go into the cache. +This example will replace the content of the `id` field with the matching content in the cache `fakeId`. The cache must be declared in the `caches` section. +Cache content can be loaded from a JSON file using the `--load-cache fakeId=fakeId.jsonl` option or via the `cache` option on another field. +If no match is found in the cache, `fromCache` blocks the current line, and subsequent lines are processed until matching content is added to the cache. A `reverse` option is available in the `caches` section to use the reverse cache dictionary. [Return to list of masks](#possible-masks) @@ -771,9 +771,9 @@ A `reverse` option is available in the `caches` section to use the reverse cache onError: "Invalid value = {{ .siret }}" # if set, this template will be executed on error ``` -This example will encrypt the `siret` column with the private key base64-encoded in the FF1_ENCRYPTION_KEY environment variable. Use the same mask with the option `decrypt: true` to re-identify the unmasked value. +This example will encrypt the `siret` column using the private key base64-encoded in the FF1_ENCRYPTION_KEY environment variable. Use the same mask with the option `decrypt: true` to re-identify the original value. -Characters outside of the domain can be preserved with `preserve: true` option. +Characters outside the domain can be preserved with the `preserve: true` option. Be sure to check [the full FPE demo](demo/demo7) to get more details about this mask. @@ -783,9 +783,9 @@ Be sure to check [the full FPE demo](demo/demo7) to get more details about this [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQEQgCbJoBMAzJgLYCGEA1lHAOYKZJIC0SOANiAMYAuMMM1aikAKwjwADpX4ALZChBUoXLGKq0RYzvMokdupD0YLCRJAGIk+iPKSCkAOSQAjAJ78QEADRJeGC4uKGh4JDAfUP5KOF4QJFCkIgA9AApnACoAFgBKFmMkPBhVRFQABmISbIBWADYAdgAOAE4UawD4UDB+Rxh3agS0cqQAdygFIqgGCYgkcrYWzCA&i=N4KABGBECmC2CGBLANpAXFAdvW0B0AzgK4BO2uAAgCYD2CimeAxnZCAL5A) -The sha3 mask will apply a variable length cryptographic hash (SHAKE variable-output-length hash function defined by FIPS-202) and then apply a base-conversion to the output. +The sha3 mask applies a variable-length cryptographic hash (SHAKE variable-output-length hash function defined by FIPS-202) and then applies a base conversion to the output. -This is useful to mask any input data into a coherent and collision resistant ID. +This is useful for masking any input data into a coherent and collision-resistant ID. ```yaml version: "1" @@ -799,7 +799,7 @@ masking: domain: "0123456789" # convert to base 10 with digits 0-9 ``` -In this example, the email will be replaced with a 29-digit collision resistant number. The collision resistance will be considered very good if the number of ID generated is less than `2^(12*8/2)`. +In this example, the email will be replaced with a 29-digit collision-resistant number. The collision resistance will be considered very good if the number of IDs generated is less than `2^(12*8/2)`. An alternative configuration to the previous example is : @@ -815,9 +815,9 @@ masking: domain: "0123456789" # convert to base 10 with digits 0-9 ``` -Here the length parameter is not given, but with the `resistance` parameter set to 10M, the mask will calculate the minimum length required (6 bytes in this example because 2^(6*8/2) > 10M). +In this case, the length parameter is not given, but with the `resistance` parameter set to 10M, the mask will calculate the minimum required length (6 bytes in this example because 2^(6*8/2) > 10M). -It can be difficult to anticipate what will be the maximum identifier string length (in characters) because it depends to the `domain` and the value of the `length` parameter (which can be invisible in the masking configuration because it is deduced from the `resistance` parameter). Therefore an optional parameter named `maxstrlen` was created, it's only purpose is to inform with an error if the maximum length (in characters) of identifier that can be produced is greater than a threshold. +It can be difficult to anticipate the maximum identifier string length (in characters) because it depends on the `domain` and the value of the `length` parameter (which may be invisible in the masking configuration because it is deduced from the `resistance` parameter). Therefore, an optional parameter named `maxstrlen` was created; its only purpose is to raise an error if the maximum length (in characters) of identifiers that can be produced exceeds a threshold. [Return to list of masks](#possible-masks) From bec6869b8d3e8130e1a02a1c193e69805b842f79 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 18 Oct 2025 14:27:22 +0000 Subject: [PATCH 4/5] Improve README English: remaining mask examples and timeline constraints Co-authored-by: adrienaury <44274230+adrienaury@users.noreply.github.com> --- README.md | 60 +++++++++++++++++++++++++++---------------------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index 4e448f09..ef8ebbe6 100755 --- a/README.md +++ b/README.md @@ -840,7 +840,7 @@ This mask will replace an integer value `{"age": 27}` with a range like this `{" [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQEQgCbIAsATJgLYCGEA1lHAOYKZJIC0SOANiAMYAuMMExYikAYiR8AFlAhIADhWkKYdPnIFIKcLWDAUAnkhgAzBeAjwIzUUgBWluIunIUg+tqgAvJbDgQAOnkLKyxRShphWwkoPiQAdyhOTiQAIxAFKGC8SRhJKQyI2gZM4M46DLwQEzpYvzSQThh4m1F5LJAo23F8jMUwEDg4mFS7XjiACgokPBhKOgBKBKSU9K0eHhAIaFTuZeVpDJQAfRQkYAowKApdjLgKMhBW2zox-gAFS8G+V1Pn8KoxUY-1s7C440EXW6tgc8GcUlc90eYWhAMiINREhqjTwcjo0CqvXMkHgxlG4yQPG0DXWm22+BmUAG-E4Bgx0L4IDI8k4Sk6qAA3gLJLE9gEkRkAL6SlGosGNCFCdndWFOJQI1AQACuYAlstRSCKUINki5PL5riFIr4Yu1uoeUplyrYHAV-EhztEqvhri5FCS+tRRs9Igkh3OFE4WoyEAEAxy+KghPD-W+ZLecVktK2Oz20jAMC19CkqFO+W01A0eRTXyGSFe4z8IZYnO5vM5luFTXi4CQ4odSGlASt3d7ATtEsHkoAAlaAscArN5jppVggA&i=N4KABGBED2BODmBDAdgSwF6IC6usgzpAFxgDa4EYollkAJtALaKrLFQDGTADigJ4A6Lo0gAaCjUjcAprHx5CJcjRrUVk5IkbT2kADbS8iMRPVR8AV1ibtuxqj0HYJs7WnMHuyKZoBfca5qrpA2OiSQAFbQOgGu5lahunSIAG6ohLHB7ix6Xj6UvvkAuqb+pkG0DB5s4cK8yIIAZs6ZtDJyCuzKZhXqIVphUIh6LGytfZbWA3ayHKiyLnGQ2Z7h3q5lgfka0+GNenCoKIvBk4nhBvCwKHQnZsseuWvbYIXqJX4UJb5AA) -If the data structure contains arrays of object like in the example below, this mask can pipe the objects into a sub pipeline definition. +If the data structure contains arrays of objects as shown in the example below, this mask can pipe the objects into a sub-pipeline definition. **`data.jsonl`** @@ -912,9 +912,9 @@ masking: template: "{{ lower .name }}.{{ lower .surname }}@{{ ._.domain }}" ``` -In addition to the `injectParent` property, this mask also provide the `injectRoot` property to inject the whole structure of data. +In addition to the `injectParent` property, this mask also provides the `injectRoot` property to inject the entire data structure. -It is possible to simplify the `masking.yml` file by referencing an external yaml definition : +You can simplify the `masking.yml` file by referencing an external YAML definition: ```yaml version: "1" @@ -928,7 +928,7 @@ masking: file: "./masking-person.yml" ``` -Be sure to check [demo](demo/demo8) to get more details about this mask. +Be sure to check [demo8](demo/demo8) for more details about this mask. [Return to list of masks](#possible-masks) @@ -936,7 +936,7 @@ Be sure to check [demo](demo/demo8) to get more details about this mask. [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFMkkBaJCEAGxAGMAXGMcyrpAKwngAOuFgAtkKKACNccLNzyFO3JLgED6ATyXKkAVzBRxAOgD09KWFxgNJhUVJUpMoxuz0sQA&i=N4KABGBECWBGCGA7SAuKkQF8g) -This mask helps you organize your masking configuration in different files, enablig reuse and mutualisation of masks. +This mask helps you organize your masking configuration across different files, enabling reuse and sharing of masks. ```yaml version: "1" @@ -967,7 +967,7 @@ The `luhn` mask can calculate the checksum for any value. In this example, the `siret` value will be appended with the correct checksum, to create a valid SIRET number (french business identifier). -The mask can be parametered to use a different universe of valid characters, internally using the [Luhn mod N](https://en.wikipedia.org/wiki/Luhn_mod_N_algorithm) algorithm. +The mask can be configured to use a different universe of valid characters, internally using the [Luhn mod N](https://en.wikipedia.org/wiki/Luhn_mod_N_algorithm) algorithm. ```yaml - selector: @@ -984,7 +984,7 @@ The mask can be parametered to use a different universe of valid characters, int [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFMkkBaJCEAGxAGMAXGMcyrpAKwngAOuFgAtkKOLmwgs3PIU7ckeMARjBFS2lIGNxAqNhgIA9CcnSAYgCVZW5bgAeVaAC8QyNAAYgA&i=N4WAUABBBEB2CGBbAptAXDa4C+Q) -[Markov chains](https://en.wikipedia.org/wiki/Markov_chain#Markov_text_generators) produces pseudo text based on an sample text. +[Markov chains](https://en.wikipedia.org/wiki/Markov_chain#Markov_text_generators) produce pseudo text based on a sample text. **sample.txt** @@ -1005,7 +1005,7 @@ I need a cheese cake separator: " " ``` -This example will mask the surname comment of the input jsonlines with a random value comment generated by the markov mask with an order of `2`. The different possibilities generated from **sample.txt** will be : +This example will mask the comment field of the input jsonlines with a random value generated by the Markov mask with an order of `2`. The different possibilities generated from **sample.txt** will be: ```txt I want a cheese burger @@ -1014,7 +1014,7 @@ I want a cheese cake I need a cheese cake ``` -The `separator` field defines the way the sample text will be split (`""` for splitting into characters, `" "` for splitting into words) +The `separator` field defines how the sample text will be split (`""` for splitting into characters, `" "` for splitting into words). [Return to list of masks](#possible-masks) @@ -1022,7 +1022,7 @@ The `separator` field defines the way the sample text will be split (`""` for sp [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFNMkkBiJAFQAsQkAXMXOCAYxgBNm8hJAHcoAGzFIwIAA5jcXZrglIuDXOy4twEJMV5RgUXgFdlYgJ6UaSAEIWk-AGa4TYlknJVaAWiQBVAAVAgFEAJQBhAEEAZRCkMRAWbUgRcUkAI2ZpOQUQXiQMh3Y4XhhsAODw6LiEpJSIa18EmGFwLnxmROSdNJUsqVl5RQKiqQ4yirFW9s66nsgmpD8DEigWXVF+7KG80eKJ8scoNY2lv3MkGBYmMFV1TQa+zOYTOBYYEzV86z8IEESWhgkC8VDBfgAVhB4DJcDdkChctoAPpwEzYLJgLBg8FIKEwuEMBH-MBQZTYsGCAignFsDjcPggZAAbwAvhRvPQmPcNAoUqp5BB-roOnBCsweHAnCcTNICh8eaRmDdmM5XO5xeojMDfkh-oCPmAaWD8XBYfDUEwAB7I4DKEwgClUKnGqh0zg8fiusFcQXC71UPzEGQmFgIgAMaAATABmAAsAFYAGwAdgAHABOKK2CIAERCADEnTiwZ8WCGw6hI7HE6nM9m84WsEtGFBdFTVBwkMpoeKkCZ-qwYHqTDJpELHHDcLr9SAgUbrDjTeaiagOmBeKj0Zjiy7F2D3Qyvfucb78P6T7jg6GI9H48n0xniyWrqGKwiAFRYIA&i=N4KABGBEAOA2CGAXApgfQHYFcC2AjZATpAFxQCCAQgLQCMATAMxUDCAIpADThQDOhAlvFgko9BgBYArADZajcXIlVKbTt0gALZAA9UANyGZkIyGRoAOAKKopFMqgBiFyajYB2N2oiQAxvAIAJhg4+ESkkGJSspGKCirsIAC+QA) -This mask produce a random string by preserving character classes from the original value. +This mask produces a random string while preserving character classes from the original value. **masking.yml** @@ -1036,7 +1036,7 @@ This mask produce a random string by preserving character classes from the origi output: "0123456789abcdef" ``` -This example will mask the original id value by replacing every characters from the `input` class by a random character from the `output` class. +This example will mask the original id value by replacing every character from the `input` class with a random character from the `output` class. ```console $ echo '{"id": "1ef619-90F"}' | pimo @@ -1075,7 +1075,7 @@ By default, if not specified otherwise, these classes will be used (input -> out [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFNMkkBaJCEAGxAGMAXGMcq7pAKwngAHXKwAWyFLgCuYjlh55CXHkmFhWUDfAjKVSAMRJWYXHAgAzDni1mklsEhB4oDJMFwMpICJT00kcLjYIBJOuC7yflQA7qIgiEgA5ADeyUgAdNKyDgA+SAAmIObSDKyoKEh52FIQrABKICQgAB4AsiLMoqgAegDauNQAXgC6ANQAAv1DYwA6M+lTI6MAJBUAvmuJvlFi8bpR-szarKasEqbwAJ7Y4-kwznDpR9iReobGphZWIrC29gVQEiaCBuDxeHxRWiBYISKD5V5+WJ7JKpDJZUQcSoFIolMooCpVGr1RotdqsTo9XoABmoAE4xqskBstgcjHFENsDrQjmYTnAzqgqULhSKETx3iYzPZrL8Qf8YLsHO5PN5OTwoUEQqgFXEwGKVLsOayqNzjqdznAri9MEA&i=N4KABGBECGCuAuALA9gJ0gLigUwLbQEsAbAAQBNl8CA7AOgGNLIQBfIA) -The partition mask will rely on conditions to identify specific cases and apply a defined list of masks for each case. Example configuration: +The partition mask relies on conditions to identify specific cases and applies a defined list of masks for each case. Example configuration: ```yaml - selector: @@ -1104,7 +1104,7 @@ The partition mask will rely on conditions to identify specific cases and apply [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFNMkkBaJCEAGxAGMAXGMcq7pAKwngAHXKwAWyFFAAmWHnkJcedECWwg4rCIqVIwKkAA8JAPQAKACgD8pgDxNWrcBAB8AbQCC1AFoBdAN4AzAC+AJRWtlJQJFCabgAM1ACc-sEhACSyOkh4rMzilFlU9o6Q2oU0SABmlWhl5dwEIACeAGJgMNgAonDAEi0taAD6nQByAMIASgCapgAqAJIA8iODANKdU5n1PFIduMQS7gBCYwAinS0A4gAS8wBSqwAyALIji6YAihMAyrMAqgA1ADqAA0pl4tllItFNHUsrRqrUCtsGs02h1ur1UP0hqNJjMFss1htISjdngDqg4mgAEwBAAsAFYAGwAdgAHIlSdw4Htcvl6rRmPAIKxcBoJMRgLgGNIkNKGABXEBYIA&i=N4KABGBECWAmkC4oAUCCAhAwgRgEwGZIQBfIA) -The segments mask allow transformations on specific parts of a field's value. This mask will use regular expressions to capture subgroups and apply transformations to them individually. Example configuration: +The segments mask allows transformations on specific parts of a field's value. This mask uses regular expressions to capture subgroups and applies transformations to them individually. Example configuration: ```yaml - selector: @@ -1167,7 +1167,7 @@ echo '{"id":"hello"}' | pimo -vtrace [![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFMkkBaJCEAGxAGMAXGMcyrpAKwngAOuFgAtkKYgDMYWbnkIRO3aklwATNUnGzlNScTUBJOAGEAygDUlyrgFcwUcSJYsBigPTuSUCCwB03qK2AEa2dGBM8CwgcP6R2O64YNje9IwQ7mgAnAAswUySkgDMAKwADGVoIADsIMElRbgATLgAHME5OVXtTUwAbO5guADu7llNTRX5Zbh91UVqJUwgTWhoM7jqOSWdIGp9TDmVZZJ9rdXuAjAEINjwfkwQwDo2lCAAHrisALLCTGIUV7cR7AZAAcgA3hCABQGD5IPyoAAqAE8BCAkBgAJRIAA+SHoMGG4CQAF9SWDAUC3rEwCjxFC-Cw0SAAPpockvV48L5MJJqazUkHgxkAOVw2Ax+MJxLAZIpVOpMRYdIZEL8cAlUpl4E51K4ipsH3RrD24mEVEY+BYVHgIC5NhEIHU4GQKtsIENyhVUGwbrAHswQA&i=N4KABGBEAuCeAOBTA+gRkgLigMwJYCdFIAacKAOwEMBbIrSAY0v1vIBNF9IQBfIA) -This mask compares targeted values or combinations of values from a JSON Entry with values from a CSV file, inserting the matched CSV line into the designated field of the JSON entry. +This mask compares targeted values or combinations of values from a JSON entry with values from a CSV file, inserting the matched CSV line into the designated field of the JSON entry. ```json {"type_1": "fire", "name": "carmender"} @@ -1205,9 +1205,9 @@ masking: trim: true # optional: trim space in values and headers, default: false ``` -In this scenario, the `findInCSV` mask is applied to the "info" field in the JSON entry. The mask utilizes both exact matching and Jaccard similarity. The expected results passes to Jaccard similarity. The configuration `expected: "at-least-one"` will return the most similar CSV line which is then saved in the `info` field. If `expected: "many"` is used, Jaccard match will return all expected matched lines in order of similarity.Using `expected: "only-one"` result in an error if the match yields more than one line. Jaccard match offers flexibility in handling variations in the entry, such as differences in accents or letter case, by leveraging the Jaccard similarity metric. +In this scenario, the `findInCSV` mask is applied to the "info" field in the JSON entry. The mask uses both exact matching and Jaccard similarity. The expected results pass to the Jaccard similarity function. The configuration `expected: "at-least-one"` will return the most similar CSV line, which is then saved in the `info` field. If `expected: "many"` is used, Jaccard matching will return all matched lines in order of similarity. Using `expected: "only-one"` results in an error if the match yields more than one line. Jaccard matching offers flexibility in handling variations in the entry, such as differences in accents or letter case, by leveraging the Jaccard similarity metric. -Here is the result of excution: +Here is the result of execution: ```json { @@ -1275,7 +1275,7 @@ $ pimo --empty-input #### Constraints -`before` and `after` constraints can be set to create better timelines, for example : +`before` and `after` constraints can be set to create better timelines, for example: ```yaml - name: "begin" @@ -1290,11 +1290,11 @@ $ pimo --empty-input The dates `begin` and `end` will both be chosen from the same interval, but `end` will always be after `begin`. -To enforce this, the timeline mask will regerate all date until all constraints are met, up to 200 retries. If there is still unsatified contraints after 200 attempts, the mask will set the date to `null`. +To enforce this, the timeline mask will regenerate all dates until all constraints are met, up to 200 retries. If there are still unsatisfied constraints after 200 attempts, the mask will set the date to `null`. -This default behavior can be changed with the following parameters : +This default behavior can be changed with the following parameters: -- `retry` sets the maximum number of retry (it can be set to `0` to disable retrying) +- `retry` sets the maximum number of retries (it can be set to `0` to disable retrying) ```yaml - timeline: @@ -1302,14 +1302,14 @@ This default behavior can be changed with the following parameters : name: "start" value: "2006-01-02T15:04:05Z" format: "2006-01-02" - retry: 0 # constraints will fail immediatly if not satisfied + retry: 0 # constraints will fail immediately if not satisfied ``` -- `onError` will change the default behavior that set date to `null` if contraints cannot be satified, following values are accepted : - - `default` : use a default value, this is the standard behavior when `onError` is unset (see next item for how to change the default value) - - `reject` : fail masking of the current line with an error +- `onError` changes the default behavior that sets the date to `null` if constraints cannot be satisfied. The following values are accepted: + - `default`: use a default value; this is the standard behavior when `onError` is unset (see next item for how to change the default value) + - `reject`: fail masking of the current line with an error - `onError` is defined on each constraint, for example : + `onError` is defined on each constraint, for example: ```yaml - name: "begin" @@ -1323,7 +1323,7 @@ This default behavior can be changed with the following parameters : onError: "reject" ``` -- `default` set the default value to use when an error occurs, if not set `null` value is the default +- `default` sets the default value to use when an error occurs; if not set, `null` is the default ```yaml - name: "begin" @@ -1339,9 +1339,9 @@ This default behavior can be changed with the following parameters : #### Epsilon -The `epsilon` parameter is the minimum period of time between two date to validate a constraint. +The `epsilon` parameter is the minimum period of time between two dates to validate a constraint. -It can be set globally on the timeline to make sure dates under constraints have a minimum amount of time between them. +It can be set globally on the timeline to ensure that dates under constraints have a minimum amount of time between them. ```yaml - timeline: @@ -1353,7 +1353,7 @@ It can be set globally on the timeline to make sure dates under constraints have epsilon: "P1Y" # minimum 1 year between dates (in constraints) ``` -For example this contraint will fail if begin is 2007-12-20 and end is 2008-05-21 (less than a year between dates). +For example, this constraint will fail if begin is 2007-12-20 and end is 2008-05-21 (less than a year between dates). ```yaml - name: "end" @@ -1363,7 +1363,7 @@ For example this contraint will fail if begin is 2007-12-20 and end is 2008-05-2 - after: "begin" ``` -It can be set locally on a single constraint (override global epsilon parameter). +It can be set locally on a single constraint (overriding the global epsilon parameter). ```yaml constraints: From affd7f96f1b391d524e8c2ddc93490cc57c12bf5 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 18 Oct 2025 14:29:49 +0000 Subject: [PATCH 5/5] Improve README English: XML, Parquet, pimo scheme, VS Code, and license sections Co-authored-by: adrienaury <44274230+adrienaury@users.noreply.github.com> --- README.md | 64 +++++++++++++++++++++++++++---------------------------- 1 file changed, 32 insertions(+), 32 deletions(-) diff --git a/README.md b/README.md index ef8ebbe6..2c38857e 100755 --- a/README.md +++ b/README.md @@ -220,7 +220,7 @@ Then go to [http://localhost:3010/](http://localhost:3010/) in your browser. ## Examples -This section will give examples for every types of mask. +This section provides examples for every type of mask. Please check the [demo folder](demo) for more advanced examples. @@ -313,14 +313,14 @@ The selected field's data will be masked with random values selected from a CSV Here is a detailed breakdown of the example configuration: -* selector: The jsonpath: "pokemon" line means that this masking configuration is meant to apply to the field named "pokemon" in the JSON data. +* selector: The `jsonpath: "pokemon"` line means this masking configuration applies to the field named "pokemon" in the JSON data. * mask: This defines the masking operation to be performed on the "pokemon" field. * randomChoiceInCSV: The mask will replace the value in the "pokemon" field with a random choice from the CSV file at the specified URL. -* uri: The location of the CSV file to use for replacement values, `file` and `http`/`https` schemes can be used. This parameter can be a template. +* uri: The location of the CSV file to use for replacement values; `file` and `http`/`https` schemes can be used. This parameter can be a template. * header: This optional parameter is set to true, meaning the CSV file contains a header line that names the fields. * separator: This optional parameter specifies that the CSV values are separated by a comma, which is the default separator in CSV files. -* comment: This optional parameter specifies that the CSV file may contain comments that start with a '#'. -* fieldsPerRecord: This optional parameter is set to 0, meaning the number of fields per record will be set to the number of fields in the first record by default. If negative, no check is made and records may have a variable number of fields. +* comment: This optional parameter specifies that the CSV file may contain comments starting with a '#'. +* fieldsPerRecord: This optional parameter is set to 0, meaning the number of fields per record will be set to the number of fields in the first record by default. If negative, no check is made, and records may have a variable number of fields. * trim: This optional parameter is set to true, meaning any spaces in values and headers in the CSV file will be trimmed. [Return to list of masks](#possible-masks) @@ -1415,12 +1415,12 @@ masking: - template: "{{index . \"date\"}}" ``` -This example masks the original attribute value with the specified template value. `jsonpath: "content"` point to the key in json that contains target XML content to be masked. The `masking` section applies all masks to the target attribute or tag in XML. +This example masks the original attribute value with the specified template value. `jsonpath: "content"` points to the key in the JSON that contains the target XML content to be masked. The `masking` section applies all masks to the target attribute or tag in the XML. -the parent object (a domain) will be accessible with the "_" variable name. -To use a parent value in template: `{{. + injectParentName + . + jsonKey}}` +The parent object (a domain) will be accessible using the "_" variable name. +To use a parent value in a template: `{{. + injectParentName + . + jsonKey}}` -For more infomation on pasing XML files. refer to [Parsing-XML-files](#parsing-xml-files) +For more information on parsing XML files, refer to [Parsing-XML-files](#parsing-xml-files). **`Output JSON`** @@ -1435,18 +1435,18 @@ For more infomation on pasing XML files. refer to [Parsing-XML-files](#parsing-x ### Parsing-XML-files -To use PIMO to masking data in an XML file, use in the following way : +To use PIMO to mask data in an XML file, use the following command: ```bash cat data.xml | pimo xml --subscriber parentTagName=MaskName.yml > maskedData.xml ``` -Pimo selects specific tags within a predefined parent tag to replace the text and store the entire data in a new XML file. These specific tags should not contain any other nested tags. +PIMO selects specific tags within a predefined parent tag to replace the text and stores the entire data in a new XML file. These specific tags should not contain any other nested tags. -To mask values of attributes, follow the rules to define your choice in jsonpath in masking.yml. +To mask values of attributes, follow these rules to define your selection in jsonpath in masking.yml: -* For attributes of parent tag, we use: `@attributeName` in jsonpath. -* For attributes of child tag, we use: `childTagName@attributeName` in jsonpath. +* For attributes of the parent tag, use: `@attributeName` in jsonpath. +* For attributes of a child tag, use: `childTagName@attributeName` in jsonpath. For example, consider an XML file named data.xml: @@ -1555,9 +1555,9 @@ After executing the command with the correct configuration, here is the expected ### Parsing Parquet files -Warning: parquet support is still an experimental feature, we are currently considering to migrate this feature to a new dataconnector type in LINO (might be dropped from PIMO in future releases) +Warning: Parquet support is still an experimental feature. We are currently considering migrating this feature to a new dataconnector type in LINO (it might be dropped from PIMO in future releases). -To mask data in a Parquet file using PIMO with the correct configuration option, follow this updated approach: +To mask data in a Parquet file using PIMO with the correct configuration option, use the following approach: ```bash pimo parquet data.parquet maskedData.parquet --config masking.yml @@ -1627,18 +1627,18 @@ This example demonstrates how to mask specific columns using PIMO, applying rand ## `pimo://` scheme -Pimo embed a usefule list of fake data. URIs that begin with a pimo:// sheme point to the pseudo files bellow. +PIMO embeds a useful list of fake data. URIs that begin with the pimo:// scheme point to the pseudo files below. name | description -----------|----------------------------- -`nameEN` | english female or male names -`nameENF` | english female names -`nameENM` | english male names -`nameFR` | french female or male names -`nameFRF` | french female names -`nameFRM` | french male names -`surnameFR`| french surnames -`townFR` | french towns names +`nameEN` | English female or male names +`nameENF` | English female names +`nameENM` | English male names +`nameFR` | French female or male names +`nameFRF` | French female names +`nameFRM` | French male names +`surnameFR`| French surnames +`townFR` | French town names The content of built-in lists are in [the `maskingdata` package](pkg/maskingdata) @@ -1646,15 +1646,15 @@ The content of built-in lists are in [the `maskingdata` package](pkg/maskingdata PIMO can generate a Mermaid syntax flow chart to visualize the transformation process. -for example the command `pimo flow masking.yml > masing.mmd` with that [masking.yml](masking.yml) file generate following chart : +For example, the command `pimo flow masking.yml > masking.mmd` with that [masking.yml](masking.yml) file generates the following chart: [![](https://mermaid.ink/img/pako:eNqtltFumzAUhl_F9RXRSESArQoXVbVQaZMWbUo77SJElQsOWAMbga02SvNEfYy92GxDUHDoBdoQCvZ_Dieffx9HOcCYJRgGcJez5zhDFQff1hEF8roitBR8Y-nHZAum0xsQi5qzAlezMmMUN3m1eEorVGZG8LFOm7i6-iFV6jWCa5ziF8vZzKfXWwtYG2e62E4O7nFy8I-TCL6aBedNPUyTZmCGNeEVE1zy3nzXj-37S6GoMBegpB62ElrYJaM1R5Rbn0meazqdbTI14ngSdwDFvWBxT84hmrBimTESY2tF4t_oz1turxDPCBbyWcU4z3FH6Q5iumM5UWoaJpUeo5z3CL9Kv1aEBsD9aIMVegmA52oq9aLJpLXxznkDznkXznndNhaFRLNwnDEQsorxbN_55A365I2lqkU10Fyt2iNrtZbtFyZpxnHS7msoSkY5uAULWw8TOZxr1lMpk7bTR-5rklS4rmecPVNzg89C_Z0-C7T4X1CdWXfyPKI8AUvC9_ZaPO2b0T0qy4xUWM-aDjivfNEKveC41SSIm9YrqUevhLNGDeXUUh-6V-eLa2fqzOUNHCfQN_jgyAv8fFjaQOepVnYd9_08vUb9vebaGnF8n_sDfe5f9Lnf_baWOYpxgeUJVHrX4_5gj_tjiQpEcgNIST0eJbQ4D7iQPNLlw2HWdunxOJOTZnSbqtxZzArNqSvNL45Jw_jP1f5Tqc6_RhxnX45q_hjL441ibth4HurZeR5o6UNRIU4YtaY_PjmhRuy9bqL2g0PI0IbyBMslJfI_wUG9FUGeyT6KYCCHCd4hkfMIRvQoU0WpuvkuIZxVMNihvMY2RIKz-z2NYcArgU9JIUFyfUWbdfwLdee3Rg)](https://mermaid.live/edit/#pako:eNqtltFumzAUhl_F9RXRSESArQoXVbVQaZMWbUo77SJElQsOWAMbga02SvNEfYy92GxDUHDoBdoQCvZ_Dieffx9HOcCYJRgGcJez5zhDFQff1hEF8roitBR8Y-nHZAum0xsQi5qzAlezMmMUN3m1eEorVGZG8LFOm7i6-iFV6jWCa5ziF8vZzKfXWwtYG2e62E4O7nFy8I-TCL6aBedNPUyTZmCGNeEVE1zy3nzXj-37S6GoMBegpB62ElrYJaM1R5Rbn0meazqdbTI14ngSdwDFvWBxT84hmrBimTESY2tF4t_oz1turxDPCBbyWcU4z3FH6Q5iumM5UWoaJpUeo5z3CL9Kv1aEBsD9aIMVegmA52oq9aLJpLXxznkDznkXznndNhaFRLNwnDEQsorxbN_55A365I2lqkU10Fyt2iNrtZbtFyZpxnHS7msoSkY5uAULWw8TOZxr1lMpk7bTR-5rklS4rmecPVNzg89C_Z0-C7T4X1CdWXfyPKI8AUvC9_ZaPO2b0T0qy4xUWM-aDjivfNEKveC41SSIm9YrqUevhLNGDeXUUh-6V-eLa2fqzOUNHCfQN_jgyAv8fFjaQOepVnYd9_08vUb9vebaGnF8n_sDfe5f9Lnf_baWOYpxgeUJVHrX4_5gj_tjiQpEcgNIST0eJbQ4D7iQPNLlw2HWdunxOJOTZnSbqtxZzArNqSvNL45Jw_jP1f5Tqc6_RhxnX45q_hjL441ibth4HurZeR5o6UNRIU4YtaY_PjmhRuy9bqL2g0PI0IbyBMslJfI_wUG9FUGeyT6KYCCHCd4hkfMIRvQoU0WpuvkuIZxVMNihvMY2RIKz-z2NYcArgU9JIUFyfUWbdfwLdee3Rg) ## Visual Studio Code -To integrate with Visual Studio Code (opens new window), download the [YAML extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml). +To integrate with Visual Studio Code, download the [YAML extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml). -Then, edit your Visual Studio Code settings `yaml.schemas` to containing the following configuration: +Then, edit your Visual Studio Code settings `yaml.schemas` to contain the following configuration: ```json { @@ -1664,7 +1664,7 @@ Then, edit your Visual Studio Code settings `yaml.schemas` to containing the fol } ``` -Using this configuration, the schema will be applied on every YAML file containing the word `masking`` in their name. +Using this configuration, the schema will be applied to every YAML file containing the word `masking` in its name. ## Contributors @@ -1672,7 +1672,7 @@ Using this configuration, the schema will be applied on every YAML file containi * Pôle Emploi * BGPN - Groupe La Poste -## Licence +## License Copyright (C) 2021 CGI France @@ -1683,9 +1683,9 @@ the Free Software Foundation, either version 3 of the License, or PIMO is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License - along with PIMO. If not, see . +along with PIMO. If not, see .