You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: documents/RWD-Lineage_Data_Standard_Specification.md
+63-5Lines changed: 63 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -124,7 +124,7 @@ The `structure` attribute classifies how a value within a source is addressed, n
124
124
125
125
This section defines the controlled terminology (codelists) governing enumerated attributes in RWD Lineage. Codelists are submitted to the CDISC Controlled Terminology team under the `RWDL` prefix and are intended to be published through CDISC and NCI Enterprise Vocabulary Services (NCI-EVS) on the standard CDISC release cadence.
126
126
127
-
The codelists in this section are finalized for V1. Additional codelists (Path Syntax, Data Model) are under discussion and will be added in a future revision once decisions are settled.
127
+
The codelists in this section are finalized for V1. Source data model conformance (e.g., FHIR R4, OMOP CDM 5.4, PCORnet CDM) is not governed by an RWDL codelist; it is declared at the submission level via Define-XML's existing `def:Standards` mechanism. See "Source Data Standards" below.
128
128
129
129
### RWDL Storage Type
130
130
@@ -152,8 +152,8 @@ Governs the `structure` attribute on the Coordinate element. Each value correspo
152
152
|`OBJECT`| Object | Value is addressed as a whole object with no sub-addressing; the URI is the location. |`URI` only. No `RowIndex`, `ColumnName`, or `Path`. |
153
153
154
154
**Coverage notes:**
155
-
- Tree-structured sources (JSON, XML, FHIR resources) are addressed as `structure="PATH"` with `syntax="JSONPath"`, `"XPath"`, or `"FHIRPath"`.
156
-
- Graph sources (property graphs, RDF triplestores) are addressed as `structure="PATH"` with `syntax="Cypher"` or `"SPARQL"`.
155
+
- Tree-structured sources (JSON, XML, FHIR resources) are addressed as `structure="PATH"` with `syntax="JSONPATH"`, `"XPATH"`, or `"FHIRPATH"`.
156
+
- Graph sources (property graphs, RDF triplestores) are addressed as `structure="PATH"` with `syntax="CYPHER"`, `"GREMLIN"`, or `"SPARQL"`.
157
157
- Key-value stores (Redis, DynamoDB) are addressed as `structure="TABULAR"` with `RowKey`/`RowKeyValue` populated and `ColumnName` omitted.
158
158
- Whole-object sources (PDF reports, medical images, opaque blobs) are addressed as `structure="OBJECT"`.
159
159
@@ -171,6 +171,7 @@ Governs the `Format` attribute on the Coordinate element. Scoped strictly to ser
171
171
|`XML`| Extensible Markup Language | Tree-structured markup format per W3C XML 1.0. |
172
172
|`NDJSON`| Newline-Delimited JSON | One JSON object per line. |
173
173
|`YAML`| YAML | Human-readable structured data serialization format. |
174
+
|`TTL`| Turtle | Terse RDF Triple Language per W3C Turtle specification; text serialization of RDF graph data. |
174
175
|`PARQUET`| Apache Parquet | Columnar binary format common in data science and analytics pipelines. |
175
176
|`AVRO`| Apache Avro | Row-based binary format with embedded schema. |
176
177
|`ORC`| Apache ORC | Columnar binary format common in Hadoop and Spark ecosystems. |
@@ -193,8 +194,65 @@ Governs the `Format` attribute on the Coordinate element. Scoped strictly to ser
193
194
|`DICOM`| DICOM | ISO 12052 medical imaging format. |
|`X12`| ASC X12 EDI | ASC X12 Electronic Data Interchange transaction sets used in healthcare claims and eligibility (e.g., 837 claims, 835 remittance, 270/271 eligibility). |
196
198
|`TXT`| Plain Text | Unstructured or semi-structured plain text. |
197
199
200
+
### RWDL Path Syntax
201
+
202
+
Governs the `syntax` attribute on the `Path` element. Required when `structure="PATH"`.
203
+
204
+
**Extensibility:** Extensible. Sponsors populating a value not present in the published codelist flag the value as an extension using the Define-XML convention (`def:ExtendedValue="Yes"`).
205
+
206
+
| Submission Value | Preferred Term | Definition |
|`GRAPHQL`| GraphQL | GraphQL query expression used to extract values from a GraphQL API response. |
214
+
|`SQL`| Structured Query Language | SQL `SELECT` statement used to address values that are not naturally captured by the decomposed `Database`/`Schema`/`Table`/`RowKey`/`ColumnName` fields, e.g., values produced by joins, computed expressions, views, or materialized views. |
215
+
|`CYPHER`| Cypher | Cypher query language for property graphs (Neo4j and openCypher-compatible databases, ISO/IEC 39075 GQL). |
216
+
|`GREMLIN`| Gremlin | Apache TinkerPop Gremlin graph traversal language for property graphs (JanusGraph, Amazon Neptune, Azure Cosmos DB Gremlin API). |
217
+
|`SPARQL`| SPARQL | SPARQL query language for RDF triplestores per W3C SPARQL specification. |
218
+
|`HL7V2`| HL7 v2 Segment Notation | Segment-field-component-subcomponent addressing used to locate values within HL7 v2 pipe-delimited messages (e.g., `PID-5.1.1`). |
219
+
|`DICOMTAG`| DICOM Tag Reference | DICOM data element tag in `(group,element)` notation used to locate metadata within DICOM files (e.g., `(0010,0010)` for Patient Name). |
220
+
|`REGEX`| Regular Expression | Regular expression with capture group locating the target value within a text source. |
221
+
222
+
**Note:** Several values appear in both this codelist and the Data Format codelist (`HL7V2`, `DICOM`/`DICOMTAG`). They are governing different attributes and are not redundant: the Data Format value declares what kind of bytes the source contains; the Path Syntax value declares what addressing language locates a value within those bytes. They commonly co-occur for the same data point.
223
+
224
+
### Source Data Standards
225
+
226
+
The data model that a source conforms to (e.g., FHIR R4, OMOP CDM 5.4, PCORnet CDM, FDA Sentinel CDM, HL7 CDA R2) is not governed by an RWDL codelist. It is declared at the submission level via Define-XML's existing `def:Standards` element (Define-XML v2.1, Section 5.3.6).
227
+
228
+
A sponsor pulling from one or more RWD source systems declares each source standard as a `def:Standard` child element of the `def:Standards` container, alongside the CDISC target standards. Per-coordinate declaration of source data model is therefore unnecessary: each Coordinate's source data model is implicit from the URI and the submission-level `def:Standards` declaration.
229
+
230
+
**Example:**
231
+
232
+
```xml
233
+
<def:Standards>
234
+
<def:StandardOID="STD.SDTMIG-3.4"
235
+
Name="SDTMIG"
236
+
Type="IG"
237
+
Version="3.4"
238
+
Status="Final"/>
239
+
<def:StandardOID="STD.OMOP-CDM-5.4"
240
+
Name="OMOP-CDM"
241
+
Type="IG"
242
+
Version="5.4"
243
+
Status="Final"/>
244
+
<def:StandardOID="STD.FHIR-R4"
245
+
Name="FHIR"
246
+
Type="IG"
247
+
Version="R4"
248
+
Status="Final"/>
249
+
</def:Standards>
250
+
```
251
+
252
+
The `def:Standard/@Name` attribute is constrained by an extensible CDISC Controlled Terminology codelist that currently scopes allowed values to CDISC standards (SDTMIG, SENDIG, ADaMIG, etc.). RWD Lineage submissions require this codelist to be extended to include common source-data standards (`OMOP-CDM`, `FHIR`, `PCORNET-CDM`, `SENTINEL-CDM`, `CDA`, etc.). This extension is submitted to the CDISC Controlled Terminology team as a separate, smaller ask alongside the four RWDL codelists.
253
+
254
+
Sponsors with multiple source data models (e.g., FHIR for EHR data and OMOP-CDM for warehouse data) declare each as a separate `def:Standard` element. Coordinates that need to reference a specific source data model in their lineage MAY do so via implementer-defined conventions (for example, embedding the standard OID in the source URI), but the controlled declaration of which standards the submission uses is centralized in `def:Standards`.
255
+
198
256
199
257
200
258
## Lineage trail metadata
@@ -269,7 +327,7 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
0 commit comments