Skip to content

Commit 84162a4

Browse files
tnagamineclaude
andcommitted
Update README.md for Example 1 and Example 2
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 6ccc1af commit 84162a4

2 files changed

Lines changed: 200 additions & 41 deletions

File tree

examples/example1/README.md

Lines changed: 100 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,118 @@
1-
# Example 1
1+
# Example 1 — Clinical Events from EHR Diagnoses, Vitals, and Notes
22

3-
Contains a simplified example of input RWD from EHR and output SDTM.
3+
This example demonstrates RWD lineage traceability for the **SDTM CE (Clinical Events)** domain, where the occurrence of two prespecified conditions — hypertension and acute myocardial infarction — is determined by combining evidence from three distinct EHR source tables: structured diagnosis codes, vital-sign measurements, and free-text clinical notes.
44

5-
**SDTM: CE** contains an excerpt of a CE table with two prespecified events, *hypertension* and *acute myocardial infarction*.
6-
The **source** tables contain source tables corresponding to patient diagnoses, vital signs, and clinical notes.
7-
Notes are modified from mtsamples.com.
5+
## Scenario
86

9-
### Algorithm
7+
A study tracks two prespecified clinical events across two subjects. Each event's occurrence (`CEOCCUR = Y/N`) is determined by a multi-source algorithm that draws on ICD-10 diagnosis codes, blood pressure readings, and NLP-extracted findings from clinical notes. The lineage captures every piece of evidence that contributes to each determination.
108

11-
**Hypertension**
12-
Patient experiences *hypertension* after the index event within the followup period:
13-
1. Patient is assigned ICDI0 diagnosis code I10–I16 (Hypertensive diseases)
14-
2. Patient has at least two elevated blood pressure measurements over the course of a 3-month period within the followup period
9+
### Source Data
10+
11+
| Table | Columns | Records | Description |
12+
|-------|---------|---------|-------------|
13+
| `pt_dx.csv` | `PT_ID`, `DATE`, `ICD10`, `TERM` | 8 | ICD-10 diagnosis codes from the EHR problem list |
14+
| `vitals.csv` | `Patno`, `Date`, `Vital`, `Value` | 8 | Vital signs including blood pressure and BMI |
15+
| `notes.csv` | `PT_ID`, `DATE`, `TEXT` | 72 | Free-text clinical notes (modified from [mtsamples.com](https://mtsamples.com)) |
16+
17+
### Target Data
18+
19+
| Table | Columns | Records | Description |
20+
|-------|---------|---------|-------------|
21+
| `ce.csv` | `USUBJID`, `CESEQ`, `CETERM`, `CEPRESP`, `CEOCCUR` | 4 | SDTM CE domain — 2 subjects × 2 prespecified conditions |
22+
23+
### SDTM Output Summary
24+
25+
| USUBJID | CESEQ | CETERM | CEPRESP | CEOCCUR |
26+
|---------|-------|--------|---------|---------|
27+
| 001 | 1 | MYOCARDIAL INFARCTION | Y | Y |
28+
| 001 | 2 | HYPERTENSION | Y | Y |
29+
| 002 | 1 | MYOCARDIAL INFARCTION | Y | N |
30+
| 002 | 2 | HYPERTENSION | Y | Y |
31+
32+
---
33+
34+
## Algorithms
35+
36+
### Hypertension
37+
38+
Patient experiences *hypertension* after the index event within the follow-up period:
39+
40+
1. Patient is assigned an ICD-10 diagnosis code I10–I16 (Hypertensive diseases)
41+
2. Patient has at least two elevated blood pressure measurements over the course of a 3-month period within the follow-up period
1542
3. Patient has diagnosed hypertension in clinical notes
1643

17-
**Myocardial Infarction**
18-
Patient experiences *acute myocardial infarction* after the index event within the followup period:
19-
1. Patient is assigned a diagnosis code I21 (Acute myocardial infarction) or I122 (Subsequent ST elevation (STEMI) and non-ST elevation (NSTEMI) myocardial infarction)
20-
2. Patient has records of acute myocardial infarction in notes within the followup period
44+
### Acute Myocardial Infarction
45+
46+
Patient experiences *acute myocardial infarction* after the index event within the follow-up period:
47+
48+
1. Patient is assigned a diagnosis code I21 (Acute myocardial infarction) or I22 (Subsequent ST elevation (STEMI) and non-ST elevation (NSTEMI) myocardial infarction)
49+
2. Patient has records of acute myocardial infarction in notes within the follow-up period
50+
51+
---
52+
53+
## Lineage Overview
54+
55+
The `rwd-lineage.xml` file contains **20 `<MapID>` elements** tracing source EHR data to SDTM CE. All 20 mappings target the `CEOCCUR` column, reflecting the fact that the clinical event occurrence flag is the outcome of a multi-source evidence evaluation.
2156

57+
### Transformation Types
58+
59+
| Type | Count | Description |
60+
|------|-------|-------------|
61+
| `DirectMap` | 7 | One-to-one value mapping (e.g., ICD-10 code or vital type used as evidence) |
62+
| `AfterIndexDate` | 5 | Temporal filter confirming the source event falls within the study follow-up window |
63+
| `NLPExtraction` | 5 | Structured finding extracted from free-text clinical notes |
64+
| `FilterByValue` | 3 | Conditional filter based on source value (e.g., blood pressure exceeding a threshold) |
65+
66+
### Lineage by CE Record
67+
68+
**Subject 001 — Myocardial Infarction (`CEOCCUR = Y`, CE row 1)**
69+
Three lineage entries trace to `pt_dx.csv` row 1 (ICD-10 code I21.3): a `DirectMap` on the ICD-10 code, a `DirectMap` on the diagnosis term, and an `AfterIndexDate` filter on the date. Two additional `NLPExtraction` entries link to `notes.csv` confirming the finding in clinical text.
70+
71+
**Subject 002 — Hypertension (`CEOCCUR = Y`, CE row 4)**
72+
Fifteen lineage entries converge on this single cell from all three source tables. Diagnosis evidence comes from `pt_dx.csv` (3 entries: ICD-10 code, term, and date filter). Blood pressure evidence comes from `vitals.csv` across three visits (9 entries: 3 rows × vital type + value filter + date filter each). NLP evidence comes from `notes.csv` (3 entries across 2 note records).
73+
74+
---
2275

2376
## Contents
2477

2578
```
2679
example1/
2780
├── README.md # This file
28-
├── Example1.xlsx # Source workbook (SDTM CE, Source PT_DX, Source VITALS, Source NOTES, RWDLineage-Table)
81+
├── Example1.xlsx # Companion workbook (SDTM CE, Source PT_DX, Source VITALS, Source NOTES, RWDLineage-Table)
2982
└── data/
3083
├── sdtm/
31-
│ └── ce.csv # SDTM CE domain (4 records: subjects 001/002 x 2 conditions)
84+
│ └── ce.csv # SDTM CE domain (4 records: subjects 001/002 × 2 conditions)
3285
├── source/
33-
│ ├── pt_dx.csv # EHR diagnoses table (PT_ID, DATE, ICD10, TERM)
34-
│ ├── vitals.csv # EHR vitals table (Patno, Date, Vital, Value)
35-
│ └── notes.csv # EHR clinical notes table (PT_ID, DATE, TEXT)
86+
│ ├── pt_dx.csv # EHR diagnoses table — 8 rows (PT_ID, DATE, ICD10, TERM)
87+
│ ├── vitals.csv # EHR vitals table — 8 rows (Patno, Date, Vital, Value)
88+
│ └── notes.csv # EHR clinical notes — 72 rows (PT_ID, DATE, TEXT)
3689
└── define/
37-
├── define.xml # Define-XML 2.1 describing the CE domain with RWD lineage reference
38-
└── rwd-lineage.xml # RWD-Lineage XML with 20 MapID elements linking source EHR data to SDTM CE
90+
├── define.xml # Define-XML 2.1 with rwdl namespace extension referencing rwd-lineage.xml
91+
└── rwd-lineage.xml # RWD-Lineage XML 20 MapID elements linking source EHR data to SDTM CE
3992
```
93+
94+
---
95+
96+
## Validation
97+
98+
From the repository root:
99+
100+
```bash
101+
# Validate the RWD-Lineage XML structure
102+
python3 tools/validate.py rwd-lineage examples/example1/data/define/rwd-lineage.xml
103+
104+
# Validate the Define-XML against CDISC XSD (requires lxml)
105+
python3 tools/validate.py define-xml examples/example1/data/define/define.xml
106+
107+
# Check that every SDTM cell has lineage coverage
108+
python3 tools/validate.py coverage examples/example1/data/sdtm examples/example1/data/define/rwd-lineage.xml
109+
```
110+
111+
---
112+
113+
## Key Concepts Demonstrated
114+
115+
- **Multi-source evidence**: A single SDTM cell (`CEOCCUR`) traces back to diagnosis codes, vital signs, *and* NLP-extracted findings — the lineage captures all contributing sources.
116+
- **Composite algorithms**: The hypertension determination requires evidence from all three source types; the lineage documents each step.
117+
- **NLP as a lineage source**: Free-text clinical notes are a first-class source in the lineage, with `NLPExtraction` documenting the transformation from unstructured text to structured evidence.
118+
- **Temporal filtering**: `AfterIndexDate` entries explicitly record that a source event was validated against the study's follow-up window.

examples/example2/README.md

Lines changed: 100 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,120 @@
1-
# Example 2
1+
# Example 2 — Laboratory Results and Adverse Events from EHR Lab Data
22

3-
Contains a simplified example of input RWD from EHR lab results and output SDTM.
3+
This example demonstrates RWD lineage traceability for two SDTM domains — **LB (Laboratory Test Results)** and **AE (Adverse Events)** — derived from a single EHR source table of LOINC-coded lab results. It showcases multi-step transformations (value parsing, unit conversion) and cross-domain derivation (lab abnormalities triggering an adverse event record).
44

5-
**SDTM: LB** contains an excerpt of a laboratory test results table derived from raw EHR lab data.
6-
**SDTM: AE** contains an adverse event record for a subject with elevated liver enzymes, derived from the LB domain.
7-
**LabResults** contains raw lab results from an EHR system including LOINC-coded lab tests, visit dates, and results in original units.
5+
## Scenario
86

9-
### Algorithm
7+
A study collects liver-enzyme lab panels (ALT, AST, ALP) for two subjects across two visits each. Raw EHR lab results arrive as composite strings with values and units combined (e.g., `"0.3507 µkat/L"`). These are parsed, converted to standard units, and mapped into the SDTM LB domain. When a subject's results cross the normal range threshold, a hepatic enzyme elevation adverse event is derived in the AE domain.
8+
9+
### Source Data
10+
11+
| Table | Columns | Records | Description |
12+
|-------|---------|---------|-------------|
13+
| `LabResults.csv` | `PATID`, `LOINC Code`, `Lab Test`, `Visit Date`, `Lab Result` | 12 | Raw LOINC-coded lab results from the EHR, with results as composite value+unit strings |
14+
15+
### Target Data
16+
17+
| Table | Columns | Records | Description |
18+
|-------|---------|---------|-------------|
19+
| `LB.csv` | `USUBJID`, `LBSEQ`, `LBTESTCD`, `LBTEST`, `LBDTC`, `LBORRES`, `LBORRESU`, `LBSTRES`, `LBSTRESU`, `LBSTNRLO`, `LBSTNRHI`, `LBNRIND` | 12 | SDTM LB domain — 2 subjects × 3 tests × 2 visits |
20+
| `AE.csv` | `USUBJID`, `AESEQ`, `AETERM`, `AEDECOD`, `AELLTCD`, `AESER`, `AEREL`, `AESTDTC` | 1 | SDTM AE domain — 1 hepatic enzyme elevation event for subject 002 |
21+
22+
### Lab Tests Covered
23+
24+
| LOINC Code | Test Name | SDTM `LBTESTCD` |
25+
|------------|-----------|------------------|
26+
| 1742-6 | Alanine Aminotransferase | ALT |
27+
| 1920-8 | Aspartate Aminotransferase | AST |
28+
| 1775-6 | Alkaline Phosphatase | ALP |
29+
30+
---
31+
32+
## Algorithms
33+
34+
### Laboratory Test Results (LB)
1035

11-
**Laboratory Test Results (LB)**
1236
Lab results from the EHR are mapped to the SDTM LB domain through the following steps:
13-
1. LOINC codes and lab test names from the source are mapped directly to `LBTESTCD`/`LBTEST`
14-
2. Visit dates are mapped directly to `LBDTC`
15-
3. Raw lab result strings (e.g. `0.3507 µkat/L`) are parsed to extract numeric values into `LBORRES`/`LBORRESU` (Lab Value Parsing)
16-
4. Original units (µkat/L) are converted to standard units (U/L) and stored in `LBSTRES`/`LBSTRESU` (Unit Conversion)
1737

18-
**Adverse Events (AE)**
38+
1. **Direct mapping**: LOINC codes and lab test names are mapped to `LBTESTCD`/`LBTEST`; visit dates are mapped to `LBDTC`
39+
2. **Lab value parsing**: Raw result strings (e.g., `"0.3507 µkat/L"`) are parsed to extract the numeric value into `LBORRES` and the unit into `LBORRESU`
40+
3. **Unit conversion**: Original units (µkat/L) are converted to standard units (U/L), with results stored in `LBSTRES`/`LBSTRESU`
41+
4. **Normal range evaluation**: Standard results are compared against reference ranges (`LBSTNRLO`, `LBSTNRHI`) to determine the normal-range indicator (`LBNRIND`)
42+
43+
### Adverse Events (AE)
44+
1945
A hepatic enzyme elevation adverse event is derived from the LB domain:
20-
1. LB records with `LBNRIND = HIGH` for ALT, AST, or ALP are identified (Elevated Liver Enzyme)
21-
2. The dictionary-derived term `AEDECOD` is populated from the elevated lab test names
46+
47+
1. LB records where `LBNRIND = HIGH` for ALT, AST, or ALP are identified
48+
2. The dictionary-derived term `AEDECOD` is populated as "Hepatic enzyme increased"
2249
3. The adverse event start date `AESTDTC` is taken from the earliest elevated lab result date
2350

51+
In this example, subject 002's second visit (2025-11-06) shows all three liver enzymes elevated above normal range, producing a single AE record.
52+
53+
---
54+
55+
## Lineage Overview
56+
57+
The `rwd-lineage.xml` file contains **99 `<MapID>` elements** tracing source EHR lab data to the SDTM LB and AE domains.
58+
59+
### Transformation Types
60+
61+
| Type | Count | Target Domain | Description |
62+
|------|-------|---------------|-------------|
63+
| `DirectMap` | 36 | LB | One-to-one mapping (LOINC code → `LBTEST`, visit date → `LBDTC`, patient ID → `USUBJID`, etc.) |
64+
| `LabValueParsing` | 24 | LB | Parsing composite result strings into numeric value and unit components |
65+
| `UnitConversion` | 24 | LB | Converting original units (µkat/L) to standard units (U/L) |
66+
| `DirectMap` | 3 | AE | Direct mappings for AE fields derived from LB data |
67+
| `ElevatedLiverEnzyme` | 12 | AE | Algorithmic derivation identifying elevated lab values to produce the adverse event |
68+
69+
### Lineage by Domain
70+
71+
**LB domain (84 lineage entries across 12 records)**
72+
Each of the 12 LB records receives 7 lineage entries covering 5 target columns: `LBTEST` (DirectMap), `LBDTC` (DirectMap), `LBORRES` (LabValueParsing × 2 — one for value, one for unit source), `LBORRESU` (LabValueParsing × 2), and `LBSTRES`/`LBSTRESU` (UnitConversion × 2 each). This pattern demonstrates the multi-step transformation pipeline where a single raw result string undergoes parsing and then conversion.
73+
74+
**AE domain (15 lineage entries for 1 record)**
75+
The single AE record for subject 002 traces to 12 `ElevatedLiverEnzyme` entries (one per source lab result contributing to the elevated-enzyme determination) plus 3 `DirectMap` entries for the derived fields (`AEDECOD`, `AESTDTC`). This demonstrates cross-domain derivation: the AE record's lineage points back through the LB transformation pipeline to the original EHR source.
76+
77+
---
2478

2579
## Contents
2680

2781
```
2882
example2/
2983
├── README.md # This file
30-
├── Example2.xlsx # Source workbook (SDTM AE, SDTM LB, Source LabResults, RWDLineage-Table)
84+
├── Example2.xlsx # Companion workbook (SDTM AE, SDTM LB, Source LabResults, RWDLineage-Table)
3185
└── data/
3286
├── sdtm/
33-
│ ├── AE.csv # SDTM AE domain (1 record: subject 002 hepatic enzyme elevation)
34-
│ └── LB.csv # SDTM LB domain (12 records: subjects 001/002 x 3 tests x 2 visits)
87+
│ ├── AE.csv # SDTM AE domain 1 record (subject 002 hepatic enzyme elevation)
88+
│ └── LB.csv # SDTM LB domain 12 records (2 subjects × 3 tests × 2 visits)
3589
├── source/
36-
│ └── LabResults.csv # LabResults table (PATID, LOINC Code, Lab Test, Visit Date, Lab Result)
90+
│ └── LabResults.csv # EHR lab results — 12 rows (PATID, LOINC Code, Lab Test, Visit Date, Lab Result)
3791
└── define/
38-
├── define.xml # Define-XML 2.1 describing the AE and LB domains with RWD lineage reference
39-
└── rwd-lineage.xml # RWD-Lineage XML with 99 MapID elements linking source EHR data to SDTM AE and LB
92+
├── define.xml # Define-XML 2.1 with rwdl namespace extension referencing rwd-lineage.xml
93+
└── rwd-lineage.xml # RWD-Lineage XML 99 MapID elements linking source EHR data to SDTM AE and LB
4094
```
95+
96+
---
97+
98+
## Validation
99+
100+
From the repository root:
101+
102+
```bash
103+
# Validate the RWD-Lineage XML structure
104+
python3 tools/validate.py rwd-lineage examples/example2/data/define/rwd-lineage.xml
105+
106+
# Validate the Define-XML against CDISC XSD (requires lxml)
107+
python3 tools/validate.py define-xml examples/example2/data/define/define.xml
108+
109+
# Check that every SDTM cell has lineage coverage
110+
python3 tools/validate.py coverage examples/example2/data/sdtm examples/example2/data/define/rwd-lineage.xml
111+
```
112+
113+
---
114+
115+
## Key Concepts Demonstrated
116+
117+
- **Multi-step transformations**: A single lab result passes through parsing → unit conversion → range evaluation, with each step recorded as a separate lineage entry.
118+
- **Cross-domain derivation**: The AE record is derived from LB domain data, which is itself derived from source EHR data — the lineage captures both hops.
119+
- **High coverage density**: 99 lineage entries across 13 SDTM records demonstrate cell-level traceability at scale, covering every column including standard-range metadata (`LBSTNRLO`, `LBSTNRHI`, `LBNRIND`).
120+
- **Composite string parsing**: The `LabValueParsing` transformation type documents how a single source field (`"0.3507 µkat/L"`) fans out into multiple target fields (numeric value + unit).

0 commit comments

Comments
 (0)