Skip to content

Commit e4fac70

Browse files
authored
Docs update (#3)
* better introduction * Add ERD to categorisation * describe new schema * removed references to iteratively decomposable * correction of algorithm description in index * a professional tone because I am a professional
1 parent 3dd99ea commit e4fac70

2 files changed

Lines changed: 224 additions & 103 deletions

File tree

src/Categorisation.md

Lines changed: 207 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -4,107 +4,212 @@ style: entrust-style.css
44
title: Categorisation of Analysis methods
55
---
66
# Categorising analysis types
7-
This is a slightly sensitive work in progress, so in lieu of a proper explanation, I'll describe the parts of the JSON schema used to describe federated analyses.
87

9-
## Name
10-
The top level name is the name or short description of the output of the analysis, e.g. Mean.
8+
The aim of representing possible federated analysis methods is to make it easier for people involved in federated research to reason and communicate about what methods are possible.
9+
There are three kinds of people this is aimed at, and ultimately we will aim to organise the information differently according to their concerns:
10+
11+
1. **Federation administrators**: people responsible for coordinating a federation of TREs. Their main concern is whether how their federation is organised will allow an analysis.
12+
2. **TRE administrators**: people responsible for managing and operating a TRE. Their main concern is whether an analysis is (i) technically possible and (ii) permissible according to their capabilities and governance.
13+
3. **Researchers**: people who want to submit analyses to TREs within a Federation in which they are an approved member of a project. Their main concern is whether they can do the analysis they want.
1114

12-
## Description
13-
A slightly longer description of the output.
14-
15-
## Aliases
16-
Some statistics get a few names.
17-
The one we are most familiar with gets top billing, the runners up go here.
18-
19-
## Tags
20-
The output might fall within some higher level categories, so we have them as tags here.
21-
22-
## Output
23-
The actual data that an analysis returns overall, and is what is output by the overall federated analysis needs a bit more description.
24-
25-
### Data type
26-
Each analysis has a different disclosure risk, but the data type can give a shorthand.
27-
This is one of:
28-
29-
- Scalar
30-
- Vector
31-
- Matrix
32-
- Higher-order
33-
- text
34-
- graphic
35-
- table
36-
37-
### Disclosure risk
38-
This is optional, but there's a place in here to put some assessment of the disclosure risk.
39-
Filling these in will require collaboration with statistical disclosure experts.
40-
41-
### Disclosure mitigation
42-
There are methods that can be used to mitigate disclosure risk.
43-
There's a free-text space to describe applicable methods.
44-
45-
## Algorithms
46-
The algorithm used to calculate the analysis is a separate concern to the final output.
47-
Different algorithms you can use to compute the analysis get their own entries.
48-
49-
### Decomposability
50-
Different analyses can be broken down into sub-tasks (decomposed) in different ways by different algorithms.
51-
The categories here are:
52-
53-
- **Fully decomposable** Some summary statistic that can be calculated by each client that can be combined with others to calculate the final statistic exactly
54-
- **Iteratively decomposable** The computation can be decomposed, but not with a single aggregation operation, requiring multiple rounds of computation and communication.
55-
- **Approximately decomposable** An approximation of the desired statistic can be calculated by aggregating summary statistics
56-
- **Non-decomposable** The computation cannot be decomposed into independently computable parts
57-
58-
### Trust requirements
59-
Independent of the output's disclosure risk, TREs may be concerned about the level of trust required to perform some federated analysis and want to know what data need to be shared.
60-
61-
#### Aggregator
62-
We recognise that the level of trust required of a central aggregator might be different to the level of trust required of nodes in the network generally, so the data shared with the aggregator is distinct.
63-
64-
#### Other clients
65-
Other clients in the network may see less data (often none) than the aggregator.
66-
67-
### Communication
68-
The capabilities of a TRE network to do different kinds of communication may limit the kinds of algorithm that can be performed.
69-
70-
#### Rounds
71-
Some kinds of analysis may only require a single round of communication, sending summary stats to an aggregator.
72-
Others may iterate for a fixed, known number of rounds, while others will need to iterate for a number of rounds that cannot be known ahead of time.
73-
This does not describe what kind of computation is happening in the clients at each round, which could be repeating one operation, or doing something different each time.
74-
75-
#### Direction
76-
Some networks might require federation only to have communication that goes from the client to the aggregator.
77-
Others might need communication to go both ways.
78-
79-
### Computation
80-
The computations that a client has to do can define what analysis is possible, independent of how the network can communicate.
81-
82-
#### Execution model
83-
Some execution models do not allow branching computation dependent on some factor that is calculated during analysis.
84-
Others allow workflow-like execution which can branch.
85-
86-
#### Persistent executors
87-
Some analyses require executors that remain active across communication rounds.
88-
For others, an executor can complete upon communication.
89-
90-
### Privacy methods
91-
There are privacy methods that can be applied to different kinds of analysis to make them acceptable for different risk budgets.
92-
93-
#### [Differential Privacy](https://en.wikipedia.org/wiki/Differential_privacy)
94-
Differential privacy can be applied at different levels: applied on the input, required on the intermediate stages, or may not be applicable to an output.
95-
96-
#### Encryption
97-
Homomorphic encryption and secure multiparty computation are widely applied.
98-
These can be compatible with, or required for, an algorithm.
99-
100-
### Performance
101-
An optional rough guess of how fast this is to compute with this algorithm.
102-
103-
### Accuracy
104-
Some algorithms compute the exact statistic, some make some approximation.
105-
106-
### Practical notes
107-
This is where to put anything that isn't captured by the other parts of description.
108-
109-
### References
110-
Academic references for the algorithm.
15+
## Analysis
16+
Publishing the output of an analysis must be acceptable to all parties.
17+
If the output is unacceptable to any of the parties, no algorithm for calculating it will be acceptable.
18+
To help make these decisions, we report the [StatBarn](https://outputchecking.org/statbarn/) for each analysis, where possible.
19+
20+
Assuming an analysis delivers an acceptable output, the algorithm used to compute it must do that in an acceptable way, which is why we describe different federated *algorithms*.
21+
22+
### Aliases
23+
Sometimes different fields will call the same statistic a different name.
24+
Federated research should be a broad church, so we have tried to keep a record of alternative names, aliases, for analyses.
25+
26+
### Relationships
27+
Some analyses might be good approximations for another, or be a special case of a wider family.
28+
We have captured some of these types of relationships so you can find the right analysis for a given federation more easily.
29+
30+
## Algorithm
31+
All of these kinds of analysis could be done by sending all of everyone's data to one place, but this is likely to be an unacceptable breach of confidentiality.
32+
Alternative algorithms for calculating these might be more acceptable for your federation, and the descriptions here are designed to make that decision easier.
33+
The basic idea for each algorithm is described, then some technical attributes are described so you can filter out unacceptable algorithms.
34+
35+
### Decomposable analysis
36+
The primary description of an analysis, which defines many of the other parameters, is whether an algorithm is decomposable; that is whether the computation can be split so that each TRE processes only its local data, producing summary statistics that an aggregator combines to produce the final result.
37+
We record three variants of decomposability:
38+
39+
- **Fully decomposable**: Each TRE can compute some summary statistic that is sufficient to calculate the statistic for the whole population of the federation.
40+
- **Non-decomposable**: Using this algorithm, the only option is to share your data.
41+
- **Iterative**: Algorithms that require multiple rounds of communication between TREs and an aggregator.
42+
43+
#### Examples
44+
You can calculate the mean across a federation by computing the local count and sum and sending those to an aggregator.
45+
46+
```mermaid
47+
sequenceDiagram
48+
TREs ->> Aggregator: sum(x), n
49+
Aggregator ->> Researcher: Mean over population
50+
```
51+
52+
This works because if you have the sum of a value, and the count of a population in each TRE, it does not matter whether you add all of them up in one place or separately.
53+
54+
There are lots of statistics you can't calculate this way though.
55+
For example, there's no summary statistic you can send to an aggregator to exactly calculate the median in one round of communication like this, so (without clever encryption methods), the median is not decomposable.
56+
57+
Federated learning requires iteration to be useful.
58+
59+
```mermaid
60+
sequenceDiagram
61+
Aggregator ->> TREs: initial model
62+
loop Federated learning
63+
loop Local training
64+
TREs ->> TREs: train on local data
65+
end
66+
TREs ->> Aggregator: send model updates
67+
Aggregator ->> Aggregator: Combine model updates
68+
Aggregator -->> Aggregator: Evaluate model (maybe)
69+
Aggregator ->> TREs: Redistribute model
70+
end
71+
```
72+
73+
Here, the TREs train the model on their local data, then send these updates back to the aggregator, where they are combined.
74+
The updated model is sent back to TREs, and the process repeats until some criterion is reached.
75+
There are a couple of important points about iterative analyses.
76+
First, some data might then be observed by other TREs, as they can see how the model has been updated each round.
77+
In this example, it's not much of a disclosure risk, but what this information is varies by analysis.
78+
Second, your federation needs to efficiently support multiple rounds of communication.
79+
80+
### Communication rounds
81+
For iterative analyses, the number of rounds that need to be carried out might be known ahead of time, or the number might depend on some metric that can't be known ahead of time.
82+
There will either be a number of rounds reported, or that the number is "adaptive".
83+
84+
### Communication directions
85+
Depending on the specifications of the federation, it may be that the only way information can travel is the TREs sending data to an aggregator.
86+
Some analyses require the aggregator to be able to send something (for example, model updates in federated learning) back to TREs.
87+
This may or may not be compatible with either the technical capabilities or governance of a TRE.
88+
89+
### Execution environment
90+
Some analysis requires complex branching of workflows, whereas others can be carried out linearly.
91+
Some require an executor that can persist over multiple rounds, whereas others can be carried out by an executor that carries out a single task and then exits.
92+
These capabilities depend on the execution environment provided by each TRE.
93+
94+
### Privacy-preserving measures
95+
Differential privacy and encryption techniques mean that some analyses and algorithms that previously represented an unacceptable disclosure risk become acceptable.
96+
This depends on the governance decisions of TREs and the federation.
97+
98+
### Observable data
99+
During an analysis, some data will be observable by some other party.
100+
Here, we make it transparent *who* can see *what*.
101+
102+
### Algorithm parameters
103+
Running an algorithm may require some set-up, for example, sharing the initial model in federated learning.
104+
105+
### Implementations
106+
If we can find an implementation of an algorithm that can be used in federated analysis, a link can be provided.
107+
108+
### Reference documents
109+
Peer-reviewed literature can help domain experts to assess the trustworthiness and applicability of an algorithm, so academic sources may be included.
110+
111+
## Entity relationship diagram
112+
113+
To allow the website to represent the information, and to allow filtering, the data are stored in a defined schema, as described by this diagram.
114+
115+
```mermaid
116+
erDiagram
117+
statistics ||--o{ statistic_aliases : "has"
118+
statistics ||--o{ statistic_relationships : "source"
119+
statistics ||--o{ statistic_relationships : "target"
120+
statistics ||--o{ algorithms : "implemented by"
121+
122+
algorithms ||--o{ observable_data : "has"
123+
algorithms ||--o{ algorithm_parameters : "has"
124+
algorithms ||--o{ implementations : "has"
125+
algorithms ||--o{ reference_docs : "cited by"
126+
127+
statistics {
128+
VARCHAR statistic_id PK
129+
TEXT description
130+
output_type output
131+
VARCHAR statbarn_id
132+
TEXT mathjax
133+
TIMESTAMP created_at
134+
TIMESTAMP updated_at
135+
TEXT notes
136+
}
137+
138+
statistic_aliases {
139+
VARCHAR statistic_id FK
140+
VARCHAR alias_name
141+
BOOLEAN is_primary
142+
}
143+
144+
statistic_relationships {
145+
VARCHAR relationship_id PK
146+
VARCHAR source_statistic_id FK
147+
VARCHAR target_statistic_id FK
148+
relationship_type relationship_type
149+
TEXT description
150+
}
151+
152+
algorithms {
153+
VARCHAR algorithm_id PK
154+
VARCHAR statistic_id FK
155+
VARCHAR name
156+
TEXT description
157+
separability_type separability
158+
TEXT mathjax
159+
INTEGER communication_rounds
160+
BOOLEAN adaptive_rounds
161+
communication_direction communication_direction
162+
BOOLEAN requires_branching
163+
BOOLEAN requires_persistence
164+
VARCHAR differential_privacy
165+
privacy_encryption_level homomorphic_encryption
166+
privacy_encryption_level multiparty_computation
167+
VARCHAR computational_complexity
168+
VARCHAR communication_complexity
169+
TIMESTAMP created_at
170+
TIMESTAMP updated_at
171+
TEXT notes
172+
}
173+
174+
observable_data {
175+
VARCHAR observable_id PK
176+
VARCHAR algorithm_id FK
177+
node_type node_type
178+
TEXT item
179+
TEXT description
180+
}
181+
182+
algorithm_parameters {
183+
VARCHAR parameter_id PK
184+
VARCHAR algorithm_id FK
185+
VARCHAR parameter_name
186+
VARCHAR parameter_type
187+
BOOLEAN required
188+
VARCHAR default_value
189+
TEXT description
190+
}
191+
192+
implementations {
193+
VARCHAR implementation_id PK
194+
VARCHAR algorithm_id FK
195+
VARCHAR language
196+
VARCHAR library_name
197+
VARCHAR repository_url
198+
VARCHAR version
199+
implementation_status status
200+
TEXT notes
201+
TIMESTAMP created_at
202+
}
203+
204+
reference_docs {
205+
VARCHAR reference_id PK
206+
VARCHAR algorithm_id FK
207+
VARCHAR name
208+
INTEGER year
209+
VARCHAR doi
210+
VARCHAR url
211+
TEXT bibtex
212+
TEXT abstract
213+
TIMESTAMP created_at
214+
}
215+
```

src/index.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,24 @@ style: entrust-style.css
1111

1212
---
1313

14-
First we justify how analyses are [categorised](Categorisation), then provide a [dashboard](analysis-breakdown) so you can enter your TRE requirements and see what analyses are compatible.
14+
<div style="margin:1rem 0;">
15+
Trusted research environments (TREs) can take part in federated analytics, where multiple partners in a network collaborate to compute some analysis, providing <em>data access</em> without <em>data sharing</em>.
16+
What analyses are feasible for TREs depends on technical possibility and the operational acceptability of different stages of analysis.
17+
In theory, any of these analyses are made possible by moving data into one place and carrying out the analysis.
18+
That might be acceptable for some TREs.
19+
However, there are many cases where this presents an unacceptable breach of confidentiality, and the analysis should be performed by calculating a local result that can be combined with similar results for other TREs.
20+
</div>
21+
22+
<div style="margin:1rem 0;">
23+
This is why an <b>analysis</b> is treated separately to the <b>algorithm(s)</b> used to compute it.
24+
The analysis covers what the final output is that is published.
25+
Algorithms describe how the analysis is computed.
26+
This provides information like what data can be observed by which other parties in the federated analysis and technical requirements for TREs.
27+
</div>
28+
29+
---
1530

31+
You can read more details on how analyses are [categorised](Categorisation), or use the [dashboard](analysis-breakdown) to see whether your TRE requirements are compatible with different federated analyses.
1632

1733
<style>
1834

0 commit comments

Comments
 (0)