Skip to content

Commit 39f0187

Browse files
authored
Text/summary statistics example (#7)
* start formatting * add contingency table * links * put visualising page back in
1 parent 6078259 commit 39f0187

5 files changed

Lines changed: 310 additions & 2 deletions

File tree

observablehq.config.js

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ export default {
2626
{name: "Submission layer wizards", path: "/examples-in-five-safes-tes/submission-layer-wizards"},
2727
{name: "Collecting results", path: "/examples-in-five-safes-tes/collecting-results"},
2828
{name: "Visualising OMOP metadata", path: "/examples-in-five-safes-tes/Bunny visualisations"},
29-
// {name: "Aggregating statistics", path: "/examples-in-five-safes-tes/aggregating-statistics"},
30-
// {name: "Contingency tables", path: "/examples-in-five-safes-tes/contingency-tables"},
29+
{name: "Aggregating statistics", path: "/examples-in-five-safes-tes/aggregating-statistics"},
30+
{name: "Contingency tables", path: "/examples-in-five-safes-tes/contingency-tables"},
3131
{name: "Five Safes TES messages", path: "/examples-in-five-safes-tes/5s-tes-messages"},
3232
]
3333
}

src/examples-in-five-safes-tes/aggregating-statistics.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,157 @@ theme: air
33
style: ../entrust-style.css
44
title: Aggregating basic statistics
55
---
6+
# Aggregating basic statistics
67

8+
This tutorial can be run as a Jupyter notebook from the [5s-TES notebooks repository](https://github.com/Health-Informatics-UoN/5s-TES-notebooks/)
9+
10+
The outputs from TES tasks can be easily used to calculate basic statistics.
11+
12+
This example will use summary statistics from a dataset in the OMOP common data model.
13+
There is a container which, given a SQL query that filters an OMOP table by your criteria, will calculate the necessary summary statistics for your final analysis.
14+
15+
This example data was collected using the [Custom Image wizard](submission-layer-wizards#custom-image) in the submission layer with these settings changed from default:
16+
17+
| Field | value |
18+
| ------- | ------------------------------------------------------------------------------ |
19+
| image | ghcr.io/health-informatics-uon/five-safes-tes-analytics-dev:sha-57c3950 |
20+
| workdir | /app |
21+
| command | --user-query=SELECT value_as_number FROM public.measurement \\nWHERE measurement_concept_id = 3000905\\nAND value_as_number IS NOT NULL<br>--analysis=variance<br>--output-filename=/outputs/output<br>--output-format=json<br> |
22+
23+
24+
![Screenshot of the custom image wizard](wizard-desc-stats.png)
25+
26+
<details>
27+
<summary>Expand to view generated JSON</summary>
28+
29+
30+
```
31+
{
32+
\"id\": \"450\",
33+
\"state\": 0,
34+
\"name\": \"test variance\",
35+
\"description\": \"Federated analysis task\",
36+
\"inputs\": null,
37+
\"outputs\": [
38+
{
39+
\"name\": \"Query Results\",
40+
\"description\": \"Results from the requested query execution\",
41+
\"url\": \"s3://\",
42+
\"path\": \"/outputs\",
43+
\"type\": \"DIRECTORY\"
44+
}
45+
],
46+
\"resources\": null,
47+
\"executors\": [
48+
{
49+
\"image\": \"ghcr.io/health-informatics-uon/five-safes-tes-analytics-dev:sha-57c3950\",
50+
\"command\": [
51+
\"--user-query=SELECT value_as_number FROM public.measurement \\nWHERE measurement_concept_id = 3000905\\nAND value_as_number IS NOT NULL\",
52+
\"--analysis=variance\",
53+
\"--output-filename=/outputs/output\",
54+
\"--output-format=json\"
55+
],
56+
\"workdir\": \"/app\",
57+
\"stdin\": null,
58+
\"stdout\": null,
59+
\"stderr\": null,
60+
\"env\": {}
61+
}
62+
],
63+
\"volumes\": null,
64+
\"tags\": {
65+
\"Project\": \"NottinghamDemo\",
66+
\"tres\": \"Nottingham TRE 01|Nottingham TRE 02\"
67+
},
68+
\"logs\": null,
69+
\"creation_time\": null
70+
}
71+
```
72+
</details>
73+
74+
The `aggregate_utils` module provided with this notebook allows you to calculate statistics for the overall population by aggregating intermediate result
75+
76+
```python
77+
from pathlib import Path
78+
from aggregate_utils import VarianceIntermediate, TTestIntermediate, make_variance_intermediate_from_json, aggregate_variance_intermediates
79+
import numpy as np
80+
```
81+
The example data are held in `./data`
82+
83+
```python
84+
paths = {
85+
"tre1": "./data/variance-tre-1.json",
86+
"tre2": "./data/variance-tre-2.json"
87+
}
88+
```
89+
90+
The `make_variance_intermediate_from_json` function reads the data from the JSON file and provides methods for aggregation.
91+
The returned values hold the count (`n`), the sum (`total`), and the sum of squares (`sum_x2`) for the value read from the original table.
92+
These three pieces of information are sufficient to calculate several other summary statistics.
93+
94+
```python
95+
variance_intermediates = {
96+
k:make_variance_intermediate_from_json(Path(v))
97+
for k,v in paths.items()
98+
}
99+
variance_intermediates
100+
```
101+
102+
`'tre1': VarianceIntermediate(n=2140, total=10819.0, sum_x2=76707.0),`
103+
104+
`'tre2': VarianceIntermediate(n=4571, total=228747.0, sum_x2=15257585.0)`
105+
106+
For example, the `aggregate_variance_intermediates` function will aggregate these as if they came from a single sample.
107+
108+
```python
109+
aggregated_intermediate = aggregate_variance_intermediates(variance_intermediates.values())
110+
aggregated_intermediate
111+
```
112+
113+
`VarianceIntermediate(n=6711, total=239566.0, sum_x2=15334292.0)`
114+
115+
The `mean` and `variance` properties are for the whole sample.
116+
```python
117+
aggregated_intermediate.mean
118+
```
119+
120+
`35.69751154820444`
121+
122+
```python
123+
aggregated_intermediate.variance
124+
```
125+
126+
`1010.6365591480935`
127+
128+
The values are from a very skewed distribution.
129+
To demonstrate how the same information can be used to conduct other common statistical analyses, random samples from a normal distribution can be generated with this code:
130+
131+
```python
132+
mu, sigma = 50, 10
133+
rng = np.random.default_rng()
134+
s = rng.normal(mu, sigma, 10)
135+
```
136+
137+
A `TTestIntermediate` uses the same three pieces of information as a `VarianceIntermediate`.
138+
139+
```python
140+
gaussian_t_test_intermediate = TTestIntermediate(
141+
n=len(s),
142+
total=np.sum(s),
143+
sum_x2=np.sum(s**2)
144+
)
145+
146+
gaussian_t_test_intermediate
147+
```
148+
149+
`TTestIntermediate(n=10, total=np.float64(476.9742997017046), sum_x2=np.float64(23459.825051161773))`
150+
151+
You can also use this information to perform a one-sample t-test.
152+
153+
```python
154+
gaussian_t_test_intermediate.one_sample_t_test(52)
155+
```
156+
157+
`0.07033644170438572`
158+
159+
There are many other analyses that can be performed with a few building blocks like this.
229 KB
Loading

src/examples-in-five-safes-tes/contingency-tables.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,159 @@ theme: air
33
style: ../entrust-style.css
44
title: Aggregating contingency tables
55
---
6+
# Aggregating data for contingency tables
67

8+
This tutorial can be run as a Jupyter notebook from the [5s-TES notebooks repository](https://github.com/Health-Informatics-UoN/5s-TES-notebooks/)
9+
10+
Federated analysis on contingency tables is relatively simple.
11+
Counts are easy to federate: each TRE calculates their local count for some group, then these are aggregated by adding the counts together.
12+
Each cell of a contingency table is a count, so the table can be federated by requesting these counts, and then statistical analyses can be performed on the aggregate.
13+
14+
```mermaid
15+
graph TD
16+
subgraph 5s-TES
17+
sub(Submission layer)
18+
tre1(TRE 1)
19+
tre2(TREs ..n)
20+
end
21+
user -- Request counts --> sub
22+
sub -- Request counts --> tre1
23+
sub -- Request counts --> tre2
24+
tre1 -- counts --> agg(User)
25+
tre2 -- counts --> agg
26+
agg -- Sum counts --> Result
27+
```
28+
29+
The example data were produced by running the [Custom Image wizard](submission-layer-wizards#custom-image) using the following parameters:
30+
31+
| Field | value |
32+
| ----- | ----- |
33+
| Docker image| ghcr.io/health-informatics-uon/five-safes-tes-analytics-dev:sha-9ac04bc |
34+
| Workdir | /app |
35+
| Commands | --user-query=SELECT g.concept_name AS gender_name, r.concept_name AS race_name\\nFROM public.person p\\nJOIN public.concept g ON p.gender_concept_id = g.concept_id\\nJOIN public.concept r ON p.race_concept_id = r.concept_id\\nWHERE p.race_concept_id IN (8515, 8516, 8527)<br>--analysis=contingency_table<br>--output-filename=/outputs/output<br>--output-format=json |
36+
37+
The UI should look like this:
38+
![A screenshot of the web application showing the custom image wizard](contingency-table-wizard.png)
39+
40+
<details>
41+
<summary>Expand for example JSON</summary>
42+
43+
```json
44+
{
45+
"id": "504",
46+
"name": "test chi-sq",
47+
"description": "Federated analysis task",
48+
"inputs": null,
49+
"outputs": [
50+
{
51+
"name": "Query Results",
52+
"description": "Results from the requested query execution",
53+
"url": "s3://",
54+
"path": "/outputs",
55+
"type": "DIRECTORY"
56+
}
57+
],
58+
"resources": null,
59+
"executors": [
60+
{
61+
"image": "ghcr.io/health-informatics-uon/five-safes-tes-analytics-dev:sha-9ac04bc",
62+
"command": [
63+
"--user-query=SELECT g.concept_name AS gender_name, r.concept_name AS race_name\nFROM public.person p\nJOIN public.concept g ON p.gender_concept_id = g.concept_id\nJOIN public.concept r ON p.race_concept_id = r.concept_id\nWHERE p.race_concept_id IN (8515, 8516, 8527)",
64+
"--analysis=contingency_table",
65+
"--output-filename=/outputs/output",
66+
"--output-format=json"
67+
],
68+
"workdir": "/app",
69+
"stdin": null,
70+
"stdout": null,
71+
"stderr": null,
72+
"env": {}
73+
}
74+
],
75+
"volumes": null,
76+
"tags": {
77+
"Project": "NottinghamDemo",
78+
"tres": "Nottingham TRE 01|Nottingham TRE 02"
79+
},
80+
"logs": null,
81+
"creation_time": null
82+
}
83+
```
84+
</details>
85+
86+
```python
87+
import pandas as pd
88+
from contingency_table_utils import read_contingency_table_from_json, aggregate_tables
89+
90+
from scipy.stats import chi2_contingency
91+
```
92+
93+
The json produced by this analysis can be read into tables using the `contingency_table_utils` module supplied.
94+
95+
```python
96+
tre1 = read_contingency_table_from_json(\"data/tre1.json\")
97+
tre2 = read_contingency_table_from_json(\"data/tre2.json\")
98+
tre1.data
99+
```
100+
101+
| gender_name | race_name | n |
102+
| ----------- | -------------------------- | ---- |
103+
| FEMALE | Asian | 29 |
104+
| FEMALE | Black or African American | 44 |
105+
| FEMALE | White | 411 |
106+
| MALE | Asian | 41 |
107+
| MALE | Black or African American | 38 |
108+
| MALE | White | 433 |
109+
110+
The data aren't very interesting, as they simply report how many men and women there are of three ethnicities in the synthetic datasets, but they serve to show how contingency tables can be assembled.
111+
112+
`aggregate_tables` checks that your tables have the same variables, and sums the counts if they do.
113+
114+
```python
115+
aggregate = aggregate_tables([tre1, tre2])
116+
```
117+
118+
The `contingency_table` property organises this data into the format for statistical analyses.
119+
120+
```python
121+
aggregate.contingency_table
122+
```
123+
124+
| | FEMALE | MALE |
125+
| ------------------------- | ------ | ------ |
126+
| Asian | 1011 | 982 |
127+
| Black or African American | 1080 | 1022 |
128+
| White | 1426 | 1441 |
129+
130+
This format can be used for `scipy.stats` contingency table functions.
131+
132+
```python
133+
chisq = chi2_contingency(aggregate.contingency_table)
134+
print(f\"The p-value for the chi-squared test is {chisq.pvalue:.3f}\")
135+
```
136+
137+
`The p-value for the chi-squared test is 0.508`
138+
139+
Phew, the synthetic data haven't got any surprising imbalances.
140+
141+
This tutorial should show how you can perform federated analyses based on contingency tables of count data.
142+
The key requirement for writing your own analyses is writing a SQL query that, like
143+
144+
```{sql}
145+
SELECT g.concept_name AS gender_name, r.concept_name AS race_name
146+
FROM public.person p
147+
JOIN public.concept g ON p.gender_concept_id = g.concept_id
148+
JOIN public.concept r ON p.race_concept_id = r.concept_id;
149+
```
150+
151+
produces a table of two categorical columns, e.g.
152+
153+
| gender_name | race_name |
154+
| ----------- | ----------------------------------------- |
155+
| FEMALE | Asian |
156+
| FEMALE | Black or African American |
157+
| FEMALE | White |
158+
| MALE | White |
159+
| FEMALE | Native Hawaiian or Other Pacific Islander |
160+
| MALE | Native Hawaiian or Other Pacific Islander |
161+
| ... | ... |
210 KB
Loading

0 commit comments

Comments
 (0)