Skip to content

Commit 553714b

Browse files
author
Deep
committed
feat: Implement RWD-Lineage and Define-XML validation tools with new XSD schemas, update example XMLs, and delete a placeholder file.
1 parent 465de36 commit 553714b

21 files changed

Lines changed: 5853 additions & 140 deletions

README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,72 @@
66

77
The objective of this project is to create a machine-readable CDISC data exchange standard for lineage metadata that is supplied along with RWD-derived SDTM, which provides the data reliability required by FDA and other regulators to use RWE as primary evidence.
88

9+
## Validation
10+
11+
The [`tools/validate.py`](tools/validate.py) script provides two validators that can be run from the repo root.
12+
13+
### Requirements
14+
15+
```bash
16+
pip install lxml # required for Define-XML XSD validation only
17+
```
18+
19+
> `lxml` is optional — the validator will run and report structural errors without it, but full XSD validation of `define.xml` files will be skipped with a warning.
20+
21+
### Validate an `rwd-lineage.xml` file
22+
23+
Checks the file against the rules in the [RWD-Lineage Data Standard Specification](RWD-Lineage_Data_Standard_Specification.md): required attributes, valid `storage`/`structure` enum values, UUID uniqueness, and required child elements per coordinate type.
24+
25+
```bash
26+
python3 tools/validate.py rwd-lineage <path/to/rwd-lineage.xml>
27+
28+
# Examples
29+
python3 tools/validate.py rwd-lineage examples/example1/data/define/rwd-lineage.xml
30+
python3 tools/validate.py rwd-lineage examples/example2/data/define/rwd-lineage.xml
31+
```
32+
33+
### Validate a `define.xml` file
34+
35+
Checks the `rwdl` namespace extension block (required for RWD Lineage) and validates against the CDISC Define-XML 2.1 XSD schema.
36+
37+
```bash
38+
python3 tools/validate.py define-xml <path/to/define.xml> [path/to/define2-1-0.xsd]
39+
40+
# Examples
41+
python3 tools/validate.py define-xml examples/example1/data/define/define.xml
42+
python3 tools/validate.py define-xml examples/example2/data/define/define.xml
43+
```
44+
45+
If no XSD path is provided, the script attempts to download and cache the schema automatically in `tools/schema/`.
46+
47+
> [!NOTE]
48+
> The validator uses `tools/schema/define2-1-0.xsd` as the entry point, which depends on base ODM schemas in `tools/cdisc-odm-1.3.2/`. Both directories are required for successful validation.
49+
50+
### Check lineage coverage against SDTM files
51+
52+
Verifies that every cell in the SDTM CSV files (each row × column combination) has a corresponding Target `<Coordinate>` entry in the lineage XML. Reports:
53+
54+
- **Missing coverage** — cells present in the SDTM data with no lineage entry (affects validity)
55+
- **Phantom entries** *(warning only)* — lineage entries pointing to rows/columns that don't exist in the data
56+
- A summary line showing fraction of cells covered
57+
58+
```bash
59+
python3 tools/validate.py coverage <path/to/sdtm/dir> <path/to/rwd-lineage.xml>
60+
61+
# Example
62+
python3 tools/validate.py coverage examples/example2/data/sdtm examples/example2/data/define/rwd-lineage.xml
63+
```
64+
65+
### Run the tests
66+
67+
```bash
68+
python3 -m unittest tools.tests.test_validate -v
69+
```
70+
71+
Exit codes: `0` = valid / fully covered, `2` = invalid / missing coverage, `1` = usage error.
72+
73+
---
74+
975
## Contribution
1076

1177
Contribution is very welcome. When you contribute to this repository you are doing so under the below licenses. Please checkout [Contribution](CONTRIBUTING.md) for additional information. All contributions must adhere to the following [Code of Conduct](CODE_OF_CONDUCT.md).
File renamed without changes.

documents/placeholder.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

examples/example1/data/define/define.xml

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,14 @@
66
<StudyDescription>Example 1: CE domain with EHR source lineage</StudyDescription>
77
<ProtocolName>RWDL-EX1</ProtocolName>
88
</GlobalVariables>
9-
<MetaDataVersion OID="MDV.Example1" Name="Example 1 Define-XML" def:DefineVersion="2.1.0" def:StandardName="SDTM" def:StandardVersion="1.9">
9+
<MetaDataVersion OID="MDV.Example1" Name="Example 1 Define-XML" def:DefineVersion="2.1.0">
10+
<def:Standards>
11+
<def:Standard OID="STD.1" Name="SDTM" Version="1.9" Type="Tabulation" Status="Final"/>
12+
</def:Standards>
1013
<rwdl:lineage>
1114
<rwdl:ref leafID="LF.RWDLINEAGE">rwd-lineage.xml</rwdl:ref>
1215
</rwdl:lineage>
13-
<ItemGroupDef OID="IG.CE" Name="CE" Repeating="Yes" IsReferenceData="No" SASDatasetName="CE" def:Structure="One record per subject per clinical event" def:Purpose="Tabulation" def:StandardOID="STD.1" def:ArchiveLocationID="LF.CE">
16+
<ItemGroupDef OID="IG.CE" Name="CE" Repeating="Yes" IsReferenceData="No" SASDatasetName="CE" def:Structure="One record per subject per clinical event" def:StandardOID="STD.1" def:ArchiveLocationID="LF.CE">
1417
<Description>
1518
<TranslatedText>Clinical Events</TranslatedText>
1619
</Description>
@@ -22,37 +25,37 @@
2225
<ItemRef ItemOID="IT.CE.CEPRESP" Mandatory="No"/>
2326
<ItemRef ItemOID="IT.CE.CEOCCUR" Mandatory="No"/>
2427
</ItemGroupDef>
25-
<ItemDef OID="IT.CE.STUDYID" Name="STUDYID" DataType="text" Length="8" def:Label="Study Identifier">
28+
<ItemDef OID="IT.CE.STUDYID" Name="STUDYID" DataType="text" Length="8">
2629
<Description>
2730
<TranslatedText>Study Identifier</TranslatedText>
2831
</Description>
2932
</ItemDef>
30-
<ItemDef OID="IT.CE.DOMAIN" Name="DOMAIN" DataType="text" Length="2" def:Label="Domain Abbreviation">
33+
<ItemDef OID="IT.CE.DOMAIN" Name="DOMAIN" DataType="text" Length="2">
3134
<Description>
3235
<TranslatedText>Domain Abbreviation</TranslatedText>
3336
</Description>
3437
</ItemDef>
35-
<ItemDef OID="IT.CE.USUBJID" Name="USUBJID" DataType="text" Length="50" def:Label="Unique Subject Identifier">
38+
<ItemDef OID="IT.CE.USUBJID" Name="USUBJID" DataType="text" Length="50">
3639
<Description>
3740
<TranslatedText>Unique Subject Identifier</TranslatedText>
3841
</Description>
3942
</ItemDef>
40-
<ItemDef OID="IT.CE.CESEQ" Name="CESEQ" DataType="integer" Length="8" def:Label="Sequence Number">
43+
<ItemDef OID="IT.CE.CESEQ" Name="CESEQ" DataType="integer" Length="8">
4144
<Description>
4245
<TranslatedText>Sequence Number</TranslatedText>
4346
</Description>
4447
</ItemDef>
45-
<ItemDef OID="IT.CE.CETERM" Name="CETERM" DataType="text" Length="200" def:Label="Reported Term for the CE">
48+
<ItemDef OID="IT.CE.CETERM" Name="CETERM" DataType="text" Length="200">
4649
<Description>
4750
<TranslatedText>Reported Term for the CE</TranslatedText>
4851
</Description>
4952
</ItemDef>
50-
<ItemDef OID="IT.CE.CEPRESP" Name="CEPRESP" DataType="text" Length="1" def:Label="CE Pre-Specified">
53+
<ItemDef OID="IT.CE.CEPRESP" Name="CEPRESP" DataType="text" Length="1">
5154
<Description>
5255
<TranslatedText>CE Pre-Specified</TranslatedText>
5356
</Description>
5457
</ItemDef>
55-
<ItemDef OID="IT.CE.CEOCCUR" Name="CEOCCUR" DataType="text" Length="1" def:Label="CE Occurrence">
58+
<ItemDef OID="IT.CE.CEOCCUR" Name="CEOCCUR" DataType="text" Length="1">
5659
<Description>
5760
<TranslatedText>CE Occurrence</TranslatedText>
5861
</Description>

examples/example2/data/define/define.xml

Lines changed: 30 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,14 @@
66
<StudyDescription>Example 2: AE and LB domains with EHR lab source lineage</StudyDescription>
77
<ProtocolName>RWDL-EX2</ProtocolName>
88
</GlobalVariables>
9-
<MetaDataVersion OID="MDV.Example2" Name="Example 2 Define-XML" def:DefineVersion="2.1.0" def:StandardName="SDTM" def:StandardVersion="1.9">
9+
<MetaDataVersion OID="MDV.Example2" Name="Example 2 Define-XML" def:DefineVersion="2.1.0">
10+
<def:Standards>
11+
<def:Standard OID="STD.1" Name="SDTM" Version="1.9" Type="Tabulation" Status="Final"/>
12+
</def:Standards>
1013
<rwdl:lineage>
1114
<rwdl:ref leafID="LF.RWDLINEAGE">rwd-lineage.xml</rwdl:ref>
1215
</rwdl:lineage>
13-
<ItemGroupDef OID="IG.AE" Name="AE" Repeating="Yes" IsReferenceData="No" SASDatasetName="AE" def:Structure="One record per subject per adverse event" def:Purpose="Tabulation" def:StandardOID="STD.1" def:ArchiveLocationID="LF.AE">
16+
<ItemGroupDef OID="IG.AE" Name="AE" Repeating="Yes" IsReferenceData="No" SASDatasetName="AE" def:Structure="One record per subject per adverse event" def:StandardOID="STD.1" def:ArchiveLocationID="LF.AE">
1417
<Description>
1518
<TranslatedText>Adverse Events</TranslatedText>
1619
</Description>
@@ -25,7 +28,7 @@
2528
<ItemRef ItemOID="IT.AE.AEREL" Mandatory="No"/>
2629
<ItemRef ItemOID="IT.AE.AESTDTC" Mandatory="No"/>
2730
</ItemGroupDef>
28-
<ItemGroupDef OID="IG.LB" Name="LB" Repeating="Yes" IsReferenceData="No" SASDatasetName="LB" def:Structure="One record per subject per lab test per visit" def:Purpose="Tabulation" def:StandardOID="STD.1" def:ArchiveLocationID="LF.LB">
31+
<ItemGroupDef OID="IG.LB" Name="LB" Repeating="Yes" IsReferenceData="No" SASDatasetName="LB" def:Structure="One record per subject per lab test per visit" def:StandardOID="STD.1" def:ArchiveLocationID="LF.LB">
2932
<Description>
3033
<TranslatedText>Laboratory Test Results</TranslatedText>
3134
</Description>
@@ -44,122 +47,122 @@
4447
<ItemRef ItemOID="IT.LB.LBSTNRHI" Mandatory="No"/>
4548
<ItemRef ItemOID="IT.LB.LBNRIND" Mandatory="No"/>
4649
</ItemGroupDef>
47-
<ItemDef OID="IT.AE.STUDYID" Name="STUDYID" DataType="text" Length="8" def:Label="Study Identifier">
50+
<ItemDef OID="IT.AE.STUDYID" Name="STUDYID" DataType="text" Length="8">
4851
<Description>
4952
<TranslatedText>Study Identifier</TranslatedText>
5053
</Description>
5154
</ItemDef>
52-
<ItemDef OID="IT.AE.DOMAIN" Name="DOMAIN" DataType="text" Length="2" def:Label="Domain Abbreviation">
55+
<ItemDef OID="IT.AE.DOMAIN" Name="DOMAIN" DataType="text" Length="2">
5356
<Description>
5457
<TranslatedText>Domain Abbreviation</TranslatedText>
5558
</Description>
5659
</ItemDef>
57-
<ItemDef OID="IT.AE.USUBJID" Name="USUBJID" DataType="text" Length="50" def:Label="Unique Subject Identifier">
60+
<ItemDef OID="IT.AE.USUBJID" Name="USUBJID" DataType="text" Length="50">
5861
<Description>
5962
<TranslatedText>Unique Subject Identifier</TranslatedText>
6063
</Description>
6164
</ItemDef>
62-
<ItemDef OID="IT.AE.AESEQ" Name="AESEQ" DataType="integer" Length="8" def:Label="Sequence Number">
65+
<ItemDef OID="IT.AE.AESEQ" Name="AESEQ" DataType="integer" Length="8">
6366
<Description>
6467
<TranslatedText>Sequence Number</TranslatedText>
6568
</Description>
6669
</ItemDef>
67-
<ItemDef OID="IT.AE.AETERM" Name="AETERM" DataType="text" Length="200" def:Label="Reported Term for AE">
70+
<ItemDef OID="IT.AE.AETERM" Name="AETERM" DataType="text" Length="200">
6871
<Description>
6972
<TranslatedText>Reported Term for AE</TranslatedText>
7073
</Description>
7174
</ItemDef>
72-
<ItemDef OID="IT.AE.AEDECOD" Name="AEDECOD" DataType="text" Length="200" def:Label="Dictionary-Derived Term">
75+
<ItemDef OID="IT.AE.AEDECOD" Name="AEDECOD" DataType="text" Length="200">
7376
<Description>
7477
<TranslatedText>Dictionary-Derived Term</TranslatedText>
7578
</Description>
7679
</ItemDef>
77-
<ItemDef OID="IT.AE.AELLTCD" Name="AELLTCD" DataType="integer" Length="8" def:Label="Lowest Level Term Code">
80+
<ItemDef OID="IT.AE.AELLTCD" Name="AELLTCD" DataType="integer" Length="8">
7881
<Description>
7982
<TranslatedText>Lowest Level Term Code</TranslatedText>
8083
</Description>
8184
</ItemDef>
82-
<ItemDef OID="IT.AE.AESER" Name="AESER" DataType="text" Length="1" def:Label="Serious Event">
85+
<ItemDef OID="IT.AE.AESER" Name="AESER" DataType="text" Length="1">
8386
<Description>
8487
<TranslatedText>Serious Event</TranslatedText>
8588
</Description>
8689
</ItemDef>
87-
<ItemDef OID="IT.AE.AEREL" Name="AEREL" DataType="text" Length="16" def:Label="Causality">
90+
<ItemDef OID="IT.AE.AEREL" Name="AEREL" DataType="text" Length="16">
8891
<Description>
8992
<TranslatedText>Causality</TranslatedText>
9093
</Description>
9194
</ItemDef>
92-
<ItemDef OID="IT.AE.AESTDTC" Name="AESTDTC" DataType="text" Length="20" def:Label="Start Date/Time">
95+
<ItemDef OID="IT.AE.AESTDTC" Name="AESTDTC" DataType="text" Length="20">
9396
<Description>
9497
<TranslatedText>Start Date/Time</TranslatedText>
9598
</Description>
9699
</ItemDef>
97-
<ItemDef OID="IT.LB.STUDYID" Name="STUDYID" DataType="text" Length="8" def:Label="Study Identifier">
100+
<ItemDef OID="IT.LB.STUDYID" Name="STUDYID" DataType="text" Length="8">
98101
<Description>
99102
<TranslatedText>Study Identifier</TranslatedText>
100103
</Description>
101104
</ItemDef>
102-
<ItemDef OID="IT.LB.DOMAIN" Name="DOMAIN" DataType="text" Length="2" def:Label="Domain Abbreviation">
105+
<ItemDef OID="IT.LB.DOMAIN" Name="DOMAIN" DataType="text" Length="2">
103106
<Description>
104107
<TranslatedText>Domain Abbreviation</TranslatedText>
105108
</Description>
106109
</ItemDef>
107-
<ItemDef OID="IT.LB.USUBJID" Name="USUBJID" DataType="text" Length="50" def:Label="Unique Subject Identifier">
110+
<ItemDef OID="IT.LB.USUBJID" Name="USUBJID" DataType="text" Length="50">
108111
<Description>
109112
<TranslatedText>Unique Subject Identifier</TranslatedText>
110113
</Description>
111114
</ItemDef>
112-
<ItemDef OID="IT.LB.LBSEQ" Name="LBSEQ" DataType="integer" Length="8" def:Label="Sequence Number">
115+
<ItemDef OID="IT.LB.LBSEQ" Name="LBSEQ" DataType="integer" Length="8">
113116
<Description>
114117
<TranslatedText>Sequence Number</TranslatedText>
115118
</Description>
116119
</ItemDef>
117-
<ItemDef OID="IT.LB.LBTESTCD" Name="LBTESTCD" DataType="text" Length="8" def:Label="Lab Test Short Name">
120+
<ItemDef OID="IT.LB.LBTESTCD" Name="LBTESTCD" DataType="text" Length="8">
118121
<Description>
119122
<TranslatedText>Lab Test Short Name</TranslatedText>
120123
</Description>
121124
</ItemDef>
122-
<ItemDef OID="IT.LB.LBTEST" Name="LBTEST" DataType="text" Length="40" def:Label="Lab Test Name">
125+
<ItemDef OID="IT.LB.LBTEST" Name="LBTEST" DataType="text" Length="40">
123126
<Description>
124127
<TranslatedText>Lab Test Name</TranslatedText>
125128
</Description>
126129
</ItemDef>
127-
<ItemDef OID="IT.LB.LBDTC" Name="LBDTC" DataType="text" Length="20" def:Label="Date/Time of Specimen Collection">
130+
<ItemDef OID="IT.LB.LBDTC" Name="LBDTC" DataType="text" Length="20">
128131
<Description>
129132
<TranslatedText>Date/Time of Specimen Collection</TranslatedText>
130133
</Description>
131134
</ItemDef>
132-
<ItemDef OID="IT.LB.LBORRES" Name="LBORRES" DataType="text" Length="200" def:Label="Result or Finding in Original Units">
135+
<ItemDef OID="IT.LB.LBORRES" Name="LBORRES" DataType="text" Length="200">
133136
<Description>
134137
<TranslatedText>Result or Finding in Original Units</TranslatedText>
135138
</Description>
136139
</ItemDef>
137-
<ItemDef OID="IT.LB.LBORRESU" Name="LBORRESU" DataType="text" Length="40" def:Label="Original Units">
140+
<ItemDef OID="IT.LB.LBORRESU" Name="LBORRESU" DataType="text" Length="40">
138141
<Description>
139142
<TranslatedText>Original Units</TranslatedText>
140143
</Description>
141144
</ItemDef>
142-
<ItemDef OID="IT.LB.LBSTRES" Name="LBSTRES" DataType="float" Length="8" def:Label="Numeric Result/Finding in Standard Units">
145+
<ItemDef OID="IT.LB.LBSTRES" Name="LBSTRES" DataType="float" Length="8">
143146
<Description>
144147
<TranslatedText>Numeric Result/Finding in Standard Units</TranslatedText>
145148
</Description>
146149
</ItemDef>
147-
<ItemDef OID="IT.LB.LBSTRESU" Name="LBSTRESU" DataType="text" Length="40" def:Label="Standard Units">
150+
<ItemDef OID="IT.LB.LBSTRESU" Name="LBSTRESU" DataType="text" Length="40">
148151
<Description>
149152
<TranslatedText>Standard Units</TranslatedText>
150153
</Description>
151154
</ItemDef>
152-
<ItemDef OID="IT.LB.LBSTNRLO" Name="LBSTNRLO" DataType="float" Length="8" def:Label="Reference Range Lower Limit">
155+
<ItemDef OID="IT.LB.LBSTNRLO" Name="LBSTNRLO" DataType="float" Length="8">
153156
<Description>
154157
<TranslatedText>Reference Range Lower Limit</TranslatedText>
155158
</Description>
156159
</ItemDef>
157-
<ItemDef OID="IT.LB.LBSTNRHI" Name="LBSTNRHI" DataType="float" Length="8" def:Label="Reference Range Upper Limit">
160+
<ItemDef OID="IT.LB.LBSTNRHI" Name="LBSTNRHI" DataType="float" Length="8">
158161
<Description>
159162
<TranslatedText>Reference Range Upper Limit</TranslatedText>
160163
</Description>
161164
</ItemDef>
162-
<ItemDef OID="IT.LB.LBNRIND" Name="LBNRIND" DataType="text" Length="10" def:Label="Reference Range Indicator">
165+
<ItemDef OID="IT.LB.LBNRIND" Name="LBNRIND" DataType="text" Length="10">
163166
<Description>
164167
<TranslatedText>Reference Range Indicator</TranslatedText>
165168
</Description>

0 commit comments

Comments
 (0)