Skip to content

Commit 128a72f

Browse files
authored
Update RWD-Lineage_Data_Standard_Specification.md
Updated with working version of CT updates
1 parent 747a3d6 commit 128a72f

1 file changed

Lines changed: 121 additions & 36 deletions

File tree

documents/RWD-Lineage_Data_Standard_Specification.md

Lines changed: 121 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -31,42 +31,45 @@ The following table defines the attributes available within a Coordinate Object.
3131

3232
| Order | Attribute | XML Data Type | Usage | Description |
3333
|-------|-----------|---------------|-------|-------------|
34-
| 1 | `storage` | string (Enum) | Required | The container type. Allowed values: `Database`, `Filesystem`, `API`, `Messages`. |
35-
| 2 | `structure` | string (Enum) | Required | The internal organization. Allowed values: `Tabular`, `Tree`, `Files`. |
34+
| 1 | `storage` | string (Enum) | Required | The container type. Values from the **RWDL Storage Type** codelist (see Controlled Terminology): `DATABASE`, `FILESYSTEM`, `API`, `MESSAGE`. |
35+
| 2 | `structure` | string (Enum) | Required | The addressing mechanism for locating a value within the source. Values from the **RWDL Structure Type** codelist (see Controlled Terminology): `TABULAR`, `PATH`, `OBJECT`. |
3636
| 3 | `URI` | string | Conditional | The full connection string, file path, or API endpoint. |
3737
| 4 | `Database` | string | Conditional | The specific database name (Required for `storage="Database"`). |
3838
| 5 | `Schema` | string | Conditional | The schema name (Required for `storage="Database"`). |
3939
| 6 | `Table` | string | Conditional | The table name (Required for `storage="Database"`). |
40-
| 7 | `RowIndex` | integer | Conditional | The row number (One of `RowIndex` or `RowKey` required for `structure="Tabular"`). |
41-
| 8 | `RowKey` | string/integer | Conditional | The Primary Key field (One of `RowIndex` or `RowKey` required for `structure="Tabular"`). |
40+
| 7 | `RowIndex` | integer | Conditional | The row number (One of `RowIndex` or `RowKey` required for `structure="TABULAR"`). |
41+
| 8 | `RowKey` | string/integer | Conditional | The Primary Key field (One of `RowIndex` or `RowKey` required for `structure="TABULAR"`). |
4242
| 9 | `RowKeyValue` | string/integer | Conditional | The Primary Key value (Required if `RowKey` is used). |
43-
| 10 | `ColumnName` | string | Conditional | The header/variable name (Required for `structure="Tabular"`). |
44-
| 11 | `Path` | string | Conditional | The navigation string (XPath/JSONPath) (Required for `structure="Tree"`). |
45-
| 12 | `Format` | string | Optional | The specific format of the file or response (e.g., "JSON", "XML", "CSV"). |
43+
| 10 | `ColumnName` | string | Conditional | The header/variable name (Optional for `structure="TABULAR"` — omitted for key-value-shaped data with row identifiers but no distinct column dimension). |
44+
| 11 | `Path` | string | Conditional | The navigation string used to address a value (e.g., XPath, JSONPath, FHIRPath, Cypher, SPARQL) (Required for `structure="PATH"`). The syntax is declared on the `Path` element via the `syntax` attribute. |
45+
| 12 | `Format` | string (Enum) | Optional | The serialization format of the source. Values from the **RWDL Data Format** codelist (see Controlled Terminology), e.g., `JSON`, `XML`, `CSV`, `PARQUET`, `XLSX`, `PDF`. |
4646

4747
### Coordinates
4848

49-
List of supported coordinates — designed to be extensible by defining the `structure` attribute in the XML schema (e.g., `type="Graph"` or `type="Stream"`).
49+
The `structure` and `storage` attributes are governed by controlled terminology. See the Controlled Terminology section for the full codelists, definitions, and submission values.
5050

5151
#### Structural Formats
5252

53-
- **Tabular** — Data organized in a row-and-column format (e.g., CSV, SQL Tables, SAS Datasets).
54-
- **Tree** — Data organized in a hierarchical, nested format (e.g., JSON, XML, FHIR resources).
55-
- **Files** — Data treated as a singular object or blob within a directory structure (e.g., PDF reports, images).
53+
The `structure` attribute classifies how a value within a source is addressed, not the data model of the source itself.
54+
55+
- **TABULAR** — Value addressed by row identifier (index or key) and column name (e.g., SQL tables, SAS XPT, CSV files, key-value stores).
56+
- **PATH** — Value addressed by a path or query expression that locates the value within a structured source (e.g., JSON, XML, FHIR resources, property graphs, RDF triplestores). The syntax of the path expression is declared on the `Path` element.
57+
- **OBJECT** — Value addressed as a whole object with no sub-addressing; the URI is the location (e.g., PDF reports, medical images, binary blobs).
5658

5759
**Scope:**
58-
- *In Scope (Current):* Deterministic, static structures where a value's location can be explicitly defined by a rigid index, key, or path (e.g., "Row 5, Col A" or `$.patient.id`).
59-
- *Out of Scope (Extensible):* Non-deterministic or unstructured data requiring semantic interpretation (e.g., free-text clinical notes requiring NLP, video/audio streams, graph databases relying on complex pattern matching).
60+
- *In Scope (Current):* Deterministic, static structures where a value's location can be explicitly defined by an index, key, path expression, or URI alone.
61+
- *Out of Scope:* Non-deterministic or unstructured data requiring semantic interpretation (e.g., free-text clinical notes requiring NLP, video/audio streams).
6062

6163
#### Storage Formats
6264

63-
- **Database** — Structured data engines requiring connection protocols (e.g., SQL, NoSQL).
64-
- **Filesystem** — Flat files stored on a local disk, network drive, or object storage (e.g., S3).
65-
- **API** — Data accessible via web service endpoints (e.g., REST, SOAP).
65+
- **DATABASE** — Structured data engines accessed via connection protocol (e.g., SQL, NoSQL).
66+
- **FILESYSTEM** — Flat files on local disk, network share, or object storage (e.g., POSIX, S3, Azure Blob, GCS).
67+
- **API** — Data accessed via request/response web service endpoint (e.g., REST, SOAP, GraphQL, FHIR API).
68+
- **MESSAGE** — Data delivered as discrete units over a message transport or event stream (e.g., HL7 v2 over MLLP, FHIR Messaging, Kafka, Kinesis, AMQP, MQTT, webhooks).
6669

6770
**Scope:**
68-
- *In Scope (Current):* Standard digital repositories accessible via common, widely supported protocols (JDBC/ODBC, POSIX/S3, HTTP/REST).
69-
- *Out of Scope (Extensible):* Physical media (paper records requiring OCR), proprietary legacy systems without standard connectivity, and Distributed Ledger Technology (blockchain).
71+
- *In Scope (Current):* Standard digital repositories accessible via common, widely supported protocols (JDBC/ODBC, POSIX/S3, HTTP/REST, message broker protocols).
72+
- *Out of Scope:* Physical media (paper records requiring OCR), proprietary legacy systems without standard connectivity, and Distributed Ledger Technology (blockchain).
7073

7174
### Lineage Trail Attributes
7275

@@ -87,28 +90,110 @@ List of supported coordinates — designed to be extensible by defining the `str
8790

8891
#### Storage Coordinates
8992

90-
**Database:**
93+
**Database (`storage="DATABASE"`):**
9194
- `URI` — The connection string (e.g., `jdbc:postgresql://host:port/db`).
9295
- `Database` — The specific database name context.
9396
- `Schema` — The schema name (e.g., `public`, `dbo`, `clinical_data`).
97+
- `Table` — The table name.
9498

95-
**Filesystem:**
99+
**Filesystem (`storage="FILESYSTEM"`):**
96100
- `URI` — The full file path or object storage URI (e.g., `file://server/share/data.csv` or `s3://bucket/key`).
97101

98-
**API:**
102+
**API (`storage="API"`):**
99103
- `URI` — The full endpoint URL including query parameters (e.g., `https://api.hospital.org/fhir/Patient/123`).
100104

105+
**Message (`storage="MESSAGE"`):**
106+
- `URI` — The transport endpoint or topic identifier (e.g., `kafka://broker:9092/topic-adt`, `mllp://hospital-feed:2575`).
107+
101108
#### Structural Coordinates
102109

103-
**Tabular Data:**
104-
- `RowIndex` — The specific row number or Primary Key value identifying the record.
105-
- `ColumnName` — The header name or variable name of the specific cell.
110+
**Tabular (`structure="TABULAR"`):**
111+
- `RowIndex` — The specific row number, OR
112+
- `RowKey` + `RowKeyValue` — The primary key field name and its value.
113+
- `ColumnName` — The header or variable name (omitted for key-value-shaped data).
114+
115+
**Path-Addressable (`structure="PATH"`):**
116+
- `Path` — The navigation or query expression used to address the value, with `syntax` attribute declaring the expression language (e.g., XPath for XML, JSONPath for JSON, FHIRPath for FHIR resources, Cypher for property graphs, SPARQL for RDF triplestores).
117+
118+
**Object (`structure="OBJECT"`):**
119+
- `URI` — The identifier of the object as a whole. No sub-addressing.
120+
121+
122+
123+
## Controlled Terminology
124+
125+
This section defines the controlled terminology (codelists) governing enumerated attributes in RWD Lineage. Codelists are submitted to the CDISC Controlled Terminology team under the `RWDL` prefix and are intended to be published through CDISC and NCI Enterprise Vocabulary Services (NCI-EVS) on the standard CDISC release cadence.
126+
127+
The codelists in this section are finalized for V1. Additional codelists (Path Syntax, Data Model) are under discussion and will be added in a future revision once decisions are settled.
128+
129+
### RWDL Storage Type
130+
131+
Governs the `storage` attribute on the Coordinate element.
132+
133+
**Extensibility:** Non-extensible. The four values comprehensively cover the architectural categories of data access (query-connection, file-path, request/response, message transport).
134+
135+
| Submission Value | Preferred Term | Definition |
136+
|------------------|----------------|------------|
137+
| `DATABASE` | Database | Structured data engine accessed via connection protocol (SQL, NoSQL). |
138+
| `FILESYSTEM` | Filesystem | Flat files on local disk, network share, or object storage (POSIX, S3, Azure Blob, GCS). |
139+
| `API` | Application Programming Interface | Data accessed via request/response web service endpoint (REST, SOAP, GraphQL, FHIR API). |
140+
| `MESSAGE` | Messages | Data delivered as discrete units over a message transport or event stream (HL7 v2, FHIR Messaging, Kafka, Kinesis, AMQP, MQTT, webhooks). |
141+
142+
### RWDL Structure Type
143+
144+
Governs the `structure` attribute on the Coordinate element. Each value corresponds to a distinct addressing mechanism rather than to the data model of the source.
145+
146+
**Extensibility:** Non-extensible. The three values correspond directly to the addressing mechanisms the specification itself defines (row-and-column, path expression, whole-object).
147+
148+
| Submission Value | Preferred Term | Definition | Required Addressing |
149+
|------------------|----------------|------------|---------------------|
150+
| `TABULAR` | Tabular | Value addressed by row identifier and column name. | `RowIndex` or (`RowKey` + `RowKeyValue`); plus `ColumnName` (optional for key-value-shaped data). |
151+
| `PATH` | Path-Addressable | Value addressed by a path or query expression that locates the value within a structured source. | `Path` element with `syntax` attribute. |
152+
| `OBJECT` | Object | Value is addressed as a whole object with no sub-addressing; the URI is the location. | `URI` only. No `RowIndex`, `ColumnName`, or `Path`. |
153+
154+
**Coverage notes:**
155+
- Tree-structured sources (JSON, XML, FHIR resources) are addressed as `structure="PATH"` with `syntax="JSONPath"`, `"XPath"`, or `"FHIRPath"`.
156+
- Graph sources (property graphs, RDF triplestores) are addressed as `structure="PATH"` with `syntax="Cypher"` or `"SPARQL"`.
157+
- Key-value stores (Redis, DynamoDB) are addressed as `structure="TABULAR"` with `RowKey`/`RowKeyValue` populated and `ColumnName` omitted.
158+
- Whole-object sources (PDF reports, medical images, opaque blobs) are addressed as `structure="OBJECT"`.
159+
160+
### RWDL Data Format
161+
162+
Governs the `Format` attribute on the Coordinate element. Scoped strictly to serialization layer: how bytes are arranged.
106163

107-
**Tree:**
108-
- `Path` — The navigation string used to traverse the hierarchy (e.g., XPath for XML, JSONPath for JSON).
164+
**Extensibility:** Extensible. Sponsors populating a value not present in the published codelist flag the value as an extension using the Define-XML convention (`def:ExtendedValue="Yes"` on the relevant CodeList element) and are encouraged to contribute commonly-used extensions back to CDISC for consideration in future codelist versions.
109165

110-
**Files:**
111-
- `URI` — The identifier of the specific file if the lineage points to the file as a whole object.
166+
| Submission Value | Preferred Term | Definition |
167+
|------------------|----------------|------------|
168+
| `CSV` | Comma-Separated Values | Delimited text, comma-separated. |
169+
| `TSV` | Tab-Separated Values | Delimited text, tab-separated. |
170+
| `JSON` | JavaScript Object Notation | Tree-structured text format per RFC 8259. |
171+
| `XML` | Extensible Markup Language | Tree-structured markup format per W3C XML 1.0. |
172+
| `NDJSON` | Newline-Delimited JSON | One JSON object per line. |
173+
| `YAML` | YAML | Human-readable structured data serialization format. |
174+
| `PARQUET` | Apache Parquet | Columnar binary format common in data science and analytics pipelines. |
175+
| `AVRO` | Apache Avro | Row-based binary format with embedded schema. |
176+
| `ORC` | Apache ORC | Columnar binary format common in Hadoop and Spark ecosystems. |
177+
| `FEATHER` | Apache Arrow Feather | Arrow-based columnar format for fast dataframe interchange between R and Python. |
178+
| `ARROW` | Apache Arrow IPC | Apache Arrow inter-process communication streaming format. |
179+
| `HDF5` | HDF5 | Hierarchical Data Format v5; used for large numerical datasets, scientific arrays, and clinical waveforms. |
180+
| `NPY` | NumPy Array | NumPy single-array binary format. |
181+
| `PKL` | Python Pickle | Python Pickle format. |
182+
| `XPT` | SAS Transport File | SAS XPORT v5 or v8 format. |
183+
| `SAS7BDAT` | SAS Dataset | Native SAS dataset format. |
184+
| `RDS` | R Data Serialization | R single-object serialization format. |
185+
| `RDA` | R Data | R workspace serialization format (multiple objects). |
186+
| `SPSS-SAV` | SPSS Dataset | IBM SPSS Statistics dataset (.sav). |
187+
| `STATA-DTA` | Stata Dataset | Stata dataset (.dta). |
188+
| `XLSX` | Excel Workbook | Microsoft Excel Office Open XML workbook. |
189+
| `XLS` | Excel Legacy Workbook | Microsoft Excel legacy binary workbook (pre-2007). |
190+
| `DOCX` | Word Document | Microsoft Word Office Open XML document. |
191+
| `RTF` | Rich Text Format | Microsoft Rich Text Format document. |
192+
| `PDF` | Portable Document Format | ISO 32000 document format. |
193+
| `DICOM` | DICOM | ISO 12052 medical imaging format. |
194+
| `JPEG` | JPEG | JPEG image format. |
195+
| `HL7V2` | HL7 v2 Message | Pipe-delimited HL7 v2 message syntax. |
196+
| `TXT` | Plain Text | Unstructured or semi-structured plain text. |
112197

113198

114199

@@ -130,7 +215,7 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
130215
<Transformation type="Direct Map">None</Transformation>
131216
<!-- Source: Hospital SQL DB -->
132217
<Source>
133-
<Coordinate storage="Database" structure="Tabular">
218+
<Coordinate storage="DATABASE" structure="TABULAR">
134219
<URI>jdbc:postgresql://hospital-db:5432/ehr</URI>
135220
<Database>ehr_prod</Database>
136221
<Schema>cardiology</Schema>
@@ -141,7 +226,7 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
141226
</Source>
142227
<!-- Target: SDTM VS Domain -->
143228
<Target>
144-
<Coordinate storage="Filesystem" structure="Tabular">
229+
<Coordinate storage="FILESYSTEM" structure="TABULAR">
145230
<URI>./sdtm/vs.xpt</URI>
146231
<RowIndex>42</RowIndex>
147232
<ColumnName>VSORRES</ColumnName>
@@ -158,15 +243,15 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
158243
<Transformation type="Unit Conversion">lb to kg</Transformation>
159244
<!-- Source: CSV Lab Report -->
160245
<Source>
161-
<Coordinate storage="Filesystem" structure="Tabular">
246+
<Coordinate storage="FILESYSTEM" structure="TABULAR">
162247
<URI>file://server/raw_data/labs_2023.csv</URI>
163248
<RowIndex>501</RowIndex>
164249
<ColumnName>RESULT_VAL</ColumnName>
165250
</Coordinate>
166251
</Source>
167252
<!-- Target: SDTM LB Domain -->
168253
<Target>
169-
<Coordinate storage="Filesystem" structure="Tabular">
254+
<Coordinate storage="FILESYSTEM" structure="TABULAR">
170255
<URI>./sdtm/lb.xpt</URI>
171256
<RowIndex>15</RowIndex>
172257
<ColumnName>LBORRES</ColumnName>
@@ -182,14 +267,14 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
182267
<Transformation type="Extraction">JSON Path Extraction</Transformation>
183268
<!-- Source: FHIR API Endpoint -->
184269
<Source>
185-
<Coordinate storage="API" structure="Tree">
270+
<Coordinate storage="API" structure="PATH">
186271
<URI>https://api.hospital.org/fhir/R4/MedicationRequest/med-abc-123</URI>
187272
<Path syntax="JSONPath">$.medicationCodeableConcept.coding[0].code</Path>
188273
</Coordinate>
189274
</Source>
190275
<!-- Target: SDTM CM Domain -->
191276
<Target>
192-
<Coordinate storage="Filesystem" structure="Tabular">
277+
<Coordinate storage="FILESYSTEM" structure="TABULAR">
193278
<URI>./sdtm/cm.xpt</URI>
194279
<RowIndex>8</RowIndex>
195280
<ColumnName>CMDECOD</ColumnName>
@@ -205,14 +290,14 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
205290
<Transformation type="Date Format">ISO8601 to SAS Date</Transformation>
206291
<!-- Source: HL7 CDA XML File -->
207292
<Source>
208-
<Coordinate storage="Filesystem" structure="Tree">
293+
<Coordinate storage="FILESYSTEM" structure="PATH">
209294
<URI>file://server/records/patient_001.xml</URI>
210295
<Path syntax="XPath">/ClinicalDocument/recordTarget/patientRole/patient/birthTime/@value</Path>
211296
</Coordinate>
212297
</Source>
213298
<!-- Target: SDTM DM Domain -->
214299
<Target>
215-
<Coordinate storage="Filesystem" structure="Tabular">
300+
<Coordinate storage="FILESYSTEM" structure="TABULAR">
216301
<URI>./sdtm/dm.xpt</URI>
217302
<RowIndex>1</RowIndex>
218303
<ColumnName>BRTHDTC</ColumnName>

0 commit comments

Comments
 (0)