Skip to content

Commit 0fc4f95

Browse files
authored
Update RWD-Lineage_Data_Standard_Specification.md
May 5 meeting update to CT > data format and path syntax
1 parent 128a72f commit 0fc4f95

1 file changed

Lines changed: 63 additions & 5 deletions

File tree

documents/RWD-Lineage_Data_Standard_Specification.md

Lines changed: 63 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ The `structure` attribute classifies how a value within a source is addressed, n
124124

125125
This section defines the controlled terminology (codelists) governing enumerated attributes in RWD Lineage. Codelists are submitted to the CDISC Controlled Terminology team under the `RWDL` prefix and are intended to be published through CDISC and NCI Enterprise Vocabulary Services (NCI-EVS) on the standard CDISC release cadence.
126126

127-
The codelists in this section are finalized for V1. Additional codelists (Path Syntax, Data Model) are under discussion and will be added in a future revision once decisions are settled.
127+
The codelists in this section are finalized for V1. Source data model conformance (e.g., FHIR R4, OMOP CDM 5.4, PCORnet CDM) is not governed by an RWDL codelist; it is declared at the submission level via Define-XML's existing `def:Standards` mechanism. See "Source Data Standards" below.
128128

129129
### RWDL Storage Type
130130

@@ -152,8 +152,8 @@ Governs the `structure` attribute on the Coordinate element. Each value correspo
152152
| `OBJECT` | Object | Value is addressed as a whole object with no sub-addressing; the URI is the location. | `URI` only. No `RowIndex`, `ColumnName`, or `Path`. |
153153

154154
**Coverage notes:**
155-
- Tree-structured sources (JSON, XML, FHIR resources) are addressed as `structure="PATH"` with `syntax="JSONPath"`, `"XPath"`, or `"FHIRPath"`.
156-
- Graph sources (property graphs, RDF triplestores) are addressed as `structure="PATH"` with `syntax="Cypher"` or `"SPARQL"`.
155+
- Tree-structured sources (JSON, XML, FHIR resources) are addressed as `structure="PATH"` with `syntax="JSONPATH"`, `"XPATH"`, or `"FHIRPATH"`.
156+
- Graph sources (property graphs, RDF triplestores) are addressed as `structure="PATH"` with `syntax="CYPHER"`, `"GREMLIN"`, or `"SPARQL"`.
157157
- Key-value stores (Redis, DynamoDB) are addressed as `structure="TABULAR"` with `RowKey`/`RowKeyValue` populated and `ColumnName` omitted.
158158
- Whole-object sources (PDF reports, medical images, opaque blobs) are addressed as `structure="OBJECT"`.
159159

@@ -171,6 +171,7 @@ Governs the `Format` attribute on the Coordinate element. Scoped strictly to ser
171171
| `XML` | Extensible Markup Language | Tree-structured markup format per W3C XML 1.0. |
172172
| `NDJSON` | Newline-Delimited JSON | One JSON object per line. |
173173
| `YAML` | YAML | Human-readable structured data serialization format. |
174+
| `TTL` | Turtle | Terse RDF Triple Language per W3C Turtle specification; text serialization of RDF graph data. |
174175
| `PARQUET` | Apache Parquet | Columnar binary format common in data science and analytics pipelines. |
175176
| `AVRO` | Apache Avro | Row-based binary format with embedded schema. |
176177
| `ORC` | Apache ORC | Columnar binary format common in Hadoop and Spark ecosystems. |
@@ -193,8 +194,65 @@ Governs the `Format` attribute on the Coordinate element. Scoped strictly to ser
193194
| `DICOM` | DICOM | ISO 12052 medical imaging format. |
194195
| `JPEG` | JPEG | JPEG image format. |
195196
| `HL7V2` | HL7 v2 Message | Pipe-delimited HL7 v2 message syntax. |
197+
| `X12` | ASC X12 EDI | ASC X12 Electronic Data Interchange transaction sets used in healthcare claims and eligibility (e.g., 837 claims, 835 remittance, 270/271 eligibility). |
196198
| `TXT` | Plain Text | Unstructured or semi-structured plain text. |
197199

200+
### RWDL Path Syntax
201+
202+
Governs the `syntax` attribute on the `Path` element. Required when `structure="PATH"`.
203+
204+
**Extensibility:** Extensible. Sponsors populating a value not present in the published codelist flag the value as an extension using the Define-XML convention (`def:ExtendedValue="Yes"`).
205+
206+
| Submission Value | Preferred Term | Definition |
207+
|------------------|----------------|------------|
208+
| `XPATH` | XPath | XML Path Language expression per W3C XPath specification. |
209+
| `JSONPATH` | JSONPath | JSON path expression per RFC 9535. |
210+
| `JSONPOINTER` | JSON Pointer | JSON Pointer syntax per RFC 6901, used to address values within JSON documents (distinct from JSONPath). |
211+
| `FHIRPATH` | FHIRPath | FHIRPath expression per HL7 FHIRPath specification. |
212+
| `JMESPATH` | JMESPath | JMESPath query expression. |
213+
| `GRAPHQL` | GraphQL | GraphQL query expression used to extract values from a GraphQL API response. |
214+
| `SQL` | Structured Query Language | SQL `SELECT` statement used to address values that are not naturally captured by the decomposed `Database`/`Schema`/`Table`/`RowKey`/`ColumnName` fields, e.g., values produced by joins, computed expressions, views, or materialized views. |
215+
| `CYPHER` | Cypher | Cypher query language for property graphs (Neo4j and openCypher-compatible databases, ISO/IEC 39075 GQL). |
216+
| `GREMLIN` | Gremlin | Apache TinkerPop Gremlin graph traversal language for property graphs (JanusGraph, Amazon Neptune, Azure Cosmos DB Gremlin API). |
217+
| `SPARQL` | SPARQL | SPARQL query language for RDF triplestores per W3C SPARQL specification. |
218+
| `HL7V2` | HL7 v2 Segment Notation | Segment-field-component-subcomponent addressing used to locate values within HL7 v2 pipe-delimited messages (e.g., `PID-5.1.1`). |
219+
| `DICOMTAG` | DICOM Tag Reference | DICOM data element tag in `(group,element)` notation used to locate metadata within DICOM files (e.g., `(0010,0010)` for Patient Name). |
220+
| `REGEX` | Regular Expression | Regular expression with capture group locating the target value within a text source. |
221+
222+
**Note:** Several values appear in both this codelist and the Data Format codelist (`HL7V2`, `DICOM`/`DICOMTAG`). They are governing different attributes and are not redundant: the Data Format value declares what kind of bytes the source contains; the Path Syntax value declares what addressing language locates a value within those bytes. They commonly co-occur for the same data point.
223+
224+
### Source Data Standards
225+
226+
The data model that a source conforms to (e.g., FHIR R4, OMOP CDM 5.4, PCORnet CDM, FDA Sentinel CDM, HL7 CDA R2) is not governed by an RWDL codelist. It is declared at the submission level via Define-XML's existing `def:Standards` element (Define-XML v2.1, Section 5.3.6).
227+
228+
A sponsor pulling from one or more RWD source systems declares each source standard as a `def:Standard` child element of the `def:Standards` container, alongside the CDISC target standards. Per-coordinate declaration of source data model is therefore unnecessary: each Coordinate's source data model is implicit from the URI and the submission-level `def:Standards` declaration.
229+
230+
**Example:**
231+
232+
```xml
233+
<def:Standards>
234+
<def:Standard OID="STD.SDTMIG-3.4"
235+
Name="SDTMIG"
236+
Type="IG"
237+
Version="3.4"
238+
Status="Final"/>
239+
<def:Standard OID="STD.OMOP-CDM-5.4"
240+
Name="OMOP-CDM"
241+
Type="IG"
242+
Version="5.4"
243+
Status="Final"/>
244+
<def:Standard OID="STD.FHIR-R4"
245+
Name="FHIR"
246+
Type="IG"
247+
Version="R4"
248+
Status="Final"/>
249+
</def:Standards>
250+
```
251+
252+
The `def:Standard/@Name` attribute is constrained by an extensible CDISC Controlled Terminology codelist that currently scopes allowed values to CDISC standards (SDTMIG, SENDIG, ADaMIG, etc.). RWD Lineage submissions require this codelist to be extended to include common source-data standards (`OMOP-CDM`, `FHIR`, `PCORNET-CDM`, `SENTINEL-CDM`, `CDA`, etc.). This extension is submitted to the CDISC Controlled Terminology team as a separate, smaller ask alongside the four RWDL codelists.
253+
254+
Sponsors with multiple source data models (e.g., FHIR for EHR data and OMOP-CDM for warehouse data) declare each as a separate `def:Standard` element. Coordinates that need to reference a specific source data model in their lineage MAY do so via implementer-defined conventions (for example, embedding the standard OID in the source URI), but the controlled declaration of which standards the submission uses is centralized in `def:Standards`.
255+
198256

199257

200258
## Lineage trail metadata
@@ -269,7 +327,7 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
269327
<Source>
270328
<Coordinate storage="API" structure="PATH">
271329
<URI>https://api.hospital.org/fhir/R4/MedicationRequest/med-abc-123</URI>
272-
<Path syntax="JSONPath">$.medicationCodeableConcept.coding[0].code</Path>
330+
<Path syntax="JSONPATH">$.medicationCodeableConcept.coding[0].code</Path>
273331
</Coordinate>
274332
</Source>
275333
<!-- Target: SDTM CM Domain -->
@@ -292,7 +350,7 @@ The core of the RWD-Lineage file is a collection (array) of `<MapID>` elements.
292350
<Source>
293351
<Coordinate storage="FILESYSTEM" structure="PATH">
294352
<URI>file://server/records/patient_001.xml</URI>
295-
<Path syntax="XPath">/ClinicalDocument/recordTarget/patientRole/patient/birthTime/@value</Path>
353+
<Path syntax="XPATH">/ClinicalDocument/recordTarget/patientRole/patient/birthTime/@value</Path>
296354
</Coordinate>
297355
</Source>
298356
<!-- Target: SDTM DM Domain -->

0 commit comments

Comments
 (0)