Contribution bounding with Group By privacy unit

# Context

**Prerequisites:** [PipleineDP terminology](https://pipelinedp.io/key-definitions/), especially privacy unit, partition key.

**Note**: we're interested in processing large datasets. Performing `group by key` on such dataset requres sending all data corresponding to specific on one machine. That's called shuffling. And it's expensive. This task about implementing a method for doing 1 shuffling instead of 2 in some specific case.

One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with `max_partitions_contributed` and `max_contribution_per_partition`. Atm it's done with 2 samplings (each of which is performing `group by`):
1. Sample `max_contributions_per_partition` per `(privacy_id, partition_key)` ([code](https://github.com/OpenMined/PipelineDP/blob/09b7aad5f26171f615f9c4d75a8de50c9e65c1bb/pipeline_dp/contribution_bounders.py#L74)) (i.e. with `group by` per  `(privacy_id, partition_key)`)
1. Sample `max_partitions_contributed` per `(partition_key)`  ([code](https://github.com/OpenMined/PipelineDP/blob/09b7aad5f26171f615f9c4d75a8de50c9e65c1bb/pipeline_dp/contribution_bounders.py#L91))  (i.e. with `group by` per  `(partition_key)`)

Another way to do sampling is to do group by `privacy_key` and to do sampling in memory (i.e. having only 1 shuffling). 

# Goal 
Implement sampling with one group by `privacy_key` and to do sampling in memory. 

**Note:** Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example `10**7`.

# Code pointers
1. [ContributionBounder](https://github.com/OpenMined/PipelineDP/blob/09b7aad5f26171f615f9c4d75a8de50c9e65c1bb/pipeline_dp/contribution_bounders.py#L25) is the abstract base class for ContributionBounders.
1. [SamplingCrossAndPerPartitionContributionBounder](https://github.com/OpenMined/PipelineDP/blob/09b7aad5f26171f615f9c4d75a8de50c9e65c1bb/pipeline_dp/contribution_bounders.py#L56) is the class which does current 2 stage sampling.
1. [SamplingPerPrivacyIdContributionBounder](https://github.com/OpenMined/PipelineDP/blob/09b7aad5f26171f615f9c4d75a8de50c9e65c1bb/pipeline_dp/contribution_bounders.py#L108C6-L108C46) is a class which samples fixed number per privacy_unit (it's more as an example)
1. [Tests for contriution bounders](https://github.com/OpenMined/PipelineDP/blob/main/tests/contribution_bounders_test.py)
1. [Contribution bounder creation](https://github.com/OpenMined/PipelineDP/blob/09b7aad5f26171f615f9c4d75a8de50c9e65c1bb/pipeline_dp/dp_engine.py#L375)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Contribution bounding with Group By privacy unit #488

Context

Goal

Code pointers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Contribution bounding with Group By privacy unit #488

Description

Context

Goal

Code pointers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions