You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: documents/RWD-Lineage_Data_Standard_Specification.md
+23-12Lines changed: 23 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ RWD Lineage is an XML-formatted extension to Define-XML, implemented as a Namesp
8
8
9
9
## Document Structure
10
10
11
-
An RWD Lineage document has a single root element, `rwdl:lineage`, with exactly two kinds of top-level children:
11
+
An RWD Lineage document has a single root element, `rwdl:lineage`, with one or two kinds of top-level children:
12
12
13
13
-**`rwdl:sourceMetadata`** — the *source metadata layer*. A single OPTIONAL element describing the source systems the lineage draws from: their names, the data models or standards they conform to, and the controlled terminologies in which their coded values are encoded. This layer carries *assertions* about the sources.
14
14
-**`rwdl:lineageTrail`** — the *lineage trail*. A single element containing an array of `rwdl:MapID` elements, each a Source–Target pair recording that a value at one physical coordinate became a value at another. This layer carries *forensic facts* about data movement.
@@ -30,10 +30,12 @@ The root element `rwdl:lineage` is the document as a whole; `rwdl:lineageTrail`
30
30
</rwdl:lineage>
31
31
```
32
32
33
-
**The two layers are parallel and independent.**The lineage trail does not reference the source metadata layer, and the source metadata layer does not depend on the trail. Removing `rwdl:sourceMetadata` does not invalidate the lineage — the bytes still flowed from source coordinate to target coordinate. This separation is deliberate: it keeps the trail a record of *what physically happened* and confines *interpretive claims about what the data means* (for example, the controlled terminology a source column is encoded in) to the source metadata layer. A reviewer can always distinguish what the lineage observed from what it asserted about its sources.
33
+
**The two layers are parallel and independent.**By default, the specification defines no formal reference from the lineage trail to the source metadata layer, and the source metadata layer does not depend on the trail. Removing `rwdl:sourceMetadata` does not invalidate the lineage — the bytes still flowed from source coordinate to target coordinate. This separation is deliberate: it keeps the trail a record of *what physically happened* and confines *interpretive claims about what the data means* (for example, the controlled terminology a source column is encoded in) to the source metadata layer. A reviewer can always distinguish what the lineage observed from what it asserted about its sources.
34
34
35
35
The remainder of this specification describes the two layers in turn — first the **Lineage Trail** (the `rwdl:MapID` array and the Coordinate model that addresses values within sources), then the **Source Metadata** layer — followed by the **Controlled Terminology** that governs the enumerated attributes used by both, worked **Examples**, and the mechanism for attaching an RWD Lineage document to **Define-XML**.
36
36
37
+
**Note on Attribute Casing:** To maintain consistency with CDISC Define-XML conventions, attributes and elements adapted from Define-XML (such as `OID`, `Dictionary`, and `Version`) retain their original PascalCase casing. Native RWDL-defined elements and attributes use camelCase or lowercase (such as `name`, `description`, and `appliesTo`).
38
+
37
39
## Lineage Trail
38
40
39
41
The lineage trail is carried in a single `rwdl:lineageTrail` element containing a collection (array) of `rwdl:MapID` elements. Each `rwdl:MapID` contains exactly one Source Coordinate and one Target Coordinate, establishing a direct link between a raw real-world data value and the standardized clinical data value derived from it. The trail is forensic: it records where a value came from, where it went, and — by reference — the transformation applied, without making semantic claims about what the value means.
@@ -59,7 +61,7 @@ A Coordinate Object locates a single value within a source or target system. Bot
59
61
60
62
The following table defines the attributes and child elements available within a `<rwdl:Coordinate>` element. Usage depends on the storage and structure types selected.
61
63
62
-
| Order |Name| XML Node Type | XML Data Type | Usage | Description |
64
+
| Order |Attribute / Element| XML Node Type | XML Data Type | Usage | Description |
| 1 |`storage`| XML Attribute | string (Enum) | Required | The container type. Values from the **RWDL Storage Type** codelist (see Controlled Terminology): `DATABASE`, `FILESYSTEM`, `API`, `MESSAGE`. |
65
67
| 2 |`structure`| XML Attribute | string (Enum) | Required | The addressing mechanism for locating a value within the source. Values from the **RWDL Structure Type** codelist (see Controlled Terminology): `TABULAR`, `PATH`, `OBJECT`. |
@@ -70,7 +72,7 @@ The following table defines the attributes and child elements available within a
70
72
| 7 |`rwdl:Table`| Child Element | string | Conditional | The table name (Required for `storage="DATABASE"`). |
71
73
| 8 |`rwdl:RowIndex`| Child Element | integer | Conditional | The row number (One of `RowIndex` or `RowKey` required for `structure="TABULAR"`). |
72
74
| 9 |`rwdl:RowKey`| Child Element | string | Conditional | The Primary Key field name (One of `RowIndex` or `RowKey` required for `structure="TABULAR"`). |
73
-
| 10 |`rwdl:RowKeyValue`| Child Element | string/integer| Conditional | The Primary Key value (Required if `RowKey` is used). |
75
+
| 10 |`rwdl:RowKeyValue`| Child Element | string | Conditional | The Primary Key value (Required if `RowKey` is used; can be a string or numeric identifier). |
74
76
| 11 |`rwdl:ColumnName`| Child Element | string | Conditional | The header/variable name (Optional for `structure="TABULAR"` — omitted for key-value-shaped data with row identifiers but no distinct column dimension). |
75
77
| 12 |`rwdl:Path`| Child Element | string | Conditional | The navigation string used to address a value (e.g., XPath, JSONPath, FHIRPath, Cypher, SPARQL) (Required for `structure="PATH"`). The syntax is declared on the `rwdl:Path` element via the `syntax` attribute. |
76
78
@@ -138,7 +140,7 @@ The `structure` attribute classifies how a value within a source is addressed, n
138
140
139
141
## Source Metadata
140
142
141
-
The source metadata layer is the second of the two top-level layers introduced in Document Structure. It is carried in a single OPTIONAL `rwdl:sourceMetadata` element and is populated once per source system rather than per data point. It holds *assertions* about the sources — their data models and the controlled terminologies their values are encoded in — kept separate from the forensic lineage trail.
143
+
The source metadata layer is one of the two top-level layers introduced in Document Structure. It is carried in a single OPTIONAL `rwdl:sourceMetadata` element and is populated once per source system rather than per data point. It holds *assertions* about the sources — their data models and the controlled terminologies their values are encoded in — kept separate from the forensic lineage trail.
142
144
143
145
Source data characterization is authoritatively documented in the sponsor's Study Data Reviewer's Guide (SDRG) and, for RWE submissions, in the RWD Reliability Assessment. The `rwdl:sourceMetadata` element provides a structured, machine-readable pointer to the same information for reviewers and tooling working directly within the RWD Lineage file, but is not intended to replace the narrative documents that authoritatively characterize source data.
144
146
@@ -174,7 +176,7 @@ The `rwdl:externalCodeList` element declares the controlled terminology (e.g., I
174
176
175
177
**Why source terminology is an assertion, not an observable fact.** As Document Structure notes, interpretive claims about what the data means are kept out of the lineage trail. Source terminology is exactly such a claim. A row identifier or column name is an observable property of the source; the claim that a given column is encoded in "ICD-10-CM 2024" is different in kind, because in many EHR and claims sources the encoding vocabulary is not explicit in the data and the claim is an inference made by a person or process applying judgment to sample data. Recording it on the Coordinate or MapID would make interpretive content indistinguishable from forensic fact to a downstream reviewer. Keeping it in the source metadata layer, declared on the source, keeps that boundary clean.
176
178
177
-
This source-side layer gives the controlled-terminology documentation called for in FDA's 2024 EHR/medical claims guidance §VI.A (accuracy of mappings across coding systems, semantics of local codes to a target terminology, and coding-practice/version changes across the study period) a structured, machine-readable home. It complements, and does not replace, the narrative characterization in the SDRG / Data Characterization Report and the coding-system declarations in the Protocol.
179
+
This source-side layer gives the controlled-terminology documentation called for in FDA's 2024 guidance, *Real-World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision-Making for Drug and Biological Products*, Section VI.A (accuracy of mappings across coding systems, semantics of local codes to a target terminology, and coding-practice/version changes across the study period) a structured, machine-readable home. It complements, and does not replace, the narrative characterization in the SDRG / Data Characterization Report and the coding-system declarations in the Protocol.
178
180
179
181
##### Attributes
180
182
@@ -183,7 +185,7 @@ This source-side layer gives the controlled-terminology documentation called for
183
185
|`Dictionary`| string | Required | The name of the external controlled terminology (e.g., `ICD-10-CM`, `LOINC`, `RxNorm`, `NDC`, `SNOMED CT`). Mirrors Define-XML `ExternalCodeList/@Dictionary`. Free-text and not governed by a CDISC Controlled Terminology codelist; published terminology lists such as the NCI Metathesaurus may be consulted as a reference for dictionary names, but values are not constrained to a CDISC-controlled set. |
184
186
|`Version`| string | Conditional | The version or release of the dictionary (e.g., `2024`, `2024-09-03`). Required where the dictionary is versioned; the literal `continuous` MAY be used for dictionaries that are continuously updated without discrete versions (e.g., NDC). Mirrors Define-XML `ExternalCodeList/@Version`. |
185
187
|`href`| string (URI) | Optional | A resolvable reference to the dictionary or its publisher. Mirrors Define-XML `ExternalCodeList/@href`. |
186
-
|`appliesTo`| string | Optional | Identifies the element, field, or column within the source the declaration applies to. The expression follows the source's own conventions (e.g., FHIRPath for FHIR sources such as `Condition.code`; dot notation for CDM tables such as `DIAGNOSIS.DX`), or uses the Coordinate addressing the specification already defines for finer-grained scoping. When omitted, the declaration applies to the source as a whole. |
188
+
|`appliesTo`| string | Optional | Identifies the element, field, or column within the source the declaration applies to. The expression follows the source's own conventions (e.g., FHIRPath for FHIR sources such as `Condition.code`; dot notation for CDM tables such as `DIAGNOSIS.DX`). When omitted, the declaration applies to the source as a whole. |
187
189
188
190
##### Child Elements
189
191
@@ -399,7 +401,7 @@ Governs the `syntax` attribute on the `rwdl:Path` element. Required when `struct
399
401
|`DICOMTAG`| DICOM Tag Reference | DICOM data element tag in `(group,element)` notation used to locate metadata within DICOM files (e.g., `(0010,0010)` for Patient Name). |
400
402
|`REGEX`| Regular Expression | Regular expression with capture group locating the target value within a text source. |
401
403
402
-
**Note:** Several values appear in both this codelist and the Data Format codelist (`HL7V2`, `DICOM`/`DICOMTAG`). They are governing different attributes and are not redundant: the Data Format value declares what kind of bytes the source contains; the Path Syntax value declares what addressing language locates a value within those bytes. They commonly co-occur for the same data point.
404
+
**Note:** Several values appear in both this codelist and the Data Format codelist (`HL7V2`, `DICOM`/`DICOMTAG`). They govern different attributes and are not redundant: the Data Format value declares what kind of bytes the source contains; the Path Syntax value declares what addressing language locates a value within those bytes. They commonly co-occur for the same data point.
403
405
404
406
405
407
## Examples
@@ -555,7 +557,7 @@ This example illustrates how `rwdl:sourceMetadata` is declared once at the docum
@@ -583,7 +585,7 @@ This example illustrates how `rwdl:sourceMetadata` is declared once at the docum
583
585
584
586
### Example 5 — XML data in filesystem
585
587
586
-
This example sources from an HL7 CDA document and demonstrates `rwdl:sourceMetadata` declaring a CDA-conformant source alongside a date-format transformation.
588
+
This example uses an HL7 CDA document as a source and demonstrates `rwdl:sourceMetadata` declaring a CDA-conformant source alongside a date-format transformation.
@@ -641,6 +643,8 @@ RWD Lineage may be supplied in either of two mutually exclusive ways:
641
643
642
644
For the referenced case, the pointer reuses the standard Define-XML external-document mechanism: a `<def:leaf>` declares the physical file (carrying the filename in `xlink:href`), and `<rwdl:lineageRef>` references it by `leafID` under the standard metadata block.
643
645
646
+
**Migration Note:** This structure replaces the `<rwdl:ref>` element used in the initial draft with `<rwdl:lineageRef>` referencing a standard `<def:leaf>` element. This aligns with Define-XML standards for external documents.
647
+
644
648
```xml
645
649
<ODMxmlns="http://www.cdisc.org/ns/odm/v1.3"
646
650
xmlns:def="http://www.cdisc.org/ns/def/v2.1"
@@ -684,10 +688,17 @@ For the referenced case, the pointer reuses the standard Define-XML external-doc
684
688
| ExternalCodeList | A Define-XML element declaring an external controlled terminology dictionary. RWD Lineage adapts it as `rwdl:externalCodeList` to declare the source-side vocabulary in which coded values are encoded. |
685
689
| FHIR | Fast Healthcare Interoperability Resources |
686
690
| JSONPath | A query language for selecting nodes in a JSON document |
691
+
| MethodDef | A Define-XML element defining a computation or derivation. RWD Lineage references it via `MethodDefOID` to describe the transformation applied from source to target value. |
687
692
| RWD | Real-World Data |
688
693
| RWE | Real-World Evidence |
689
-
| MethodDef | A Define-XML element defining a computation or derivation. RWD Lineage references it via `MethodDefOID` to describe the transformation applied from source to target value. |
694
+
| rwdl:Coordinate| The `rwdl:Coordinate` element: locates a single data point within a storage container and structural format. |
695
+
| rwdl:externalCodeList| The `rwdl:externalCodeList` element: declares the external controlled terminology dictionary (e.g., ICD-10-CM) used for a source element. |
696
+
| rwdl:lineage| The root element of an RWD Lineage metadata document. |
690
697
| rwdl:lineageTrail| The `rwdl:lineageTrail` element: one of the two top-level layers of an RWD Lineage document, containing the array of `rwdl:MapID` Source–Target pairs that form the forensic trail. |
698
+
| rwdl:MapID| The `rwdl:MapID` element: a Source–Target pair linking a raw real-world data point coordinate to a standardized clinical target coordinate. |
699
+
| rwdl:source| The `rwdl:source` element: describes a single real-world source system within `rwdl:sourceMetadata`. |
700
+
| rwdl:sourceMetadata| The `rwdl:sourceMetadata` element: one of the two top-level layers of an RWD Lineage document, containing machine-readable assertions about the source data systems. |
701
+
| rwdl:standard| The `rwdl:standard` element: declares the data model or standard (e.g., FHIR, OMOP) to which a source conforms. |
0 commit comments