-
Notifications
You must be signed in to change notification settings - Fork 1
define json introduction
"Don't send data without its Definition"
Define-JSON is an open data model for describing clinical trial datasets — what they contain, where the data came from, how it was derived, and what it means. It was developed under CDISC as part of the 360i project, and is designed to complement three existing CDISC standards:
- USDM (Unified Study Definitions Model) — the study protocol layer
- ODM (Operational Data Model) — the data collection layer
- Dataset-JSON — the data exchange layer
Where those standards handle the study design and the data itself, Define-JSON handles the metadata contract in between: the precise, machine-readable description of what data was agreed upon, what was delivered, and how the two relate.
In clinical trials today, datasets and their definitions travel separately — typically as a Define-XML file attached to a submission package. This works for regulatory review but breaks down for everything else: data integration, automated transformation, cross-study comparison, and supplier agreements.
Define-JSON addresses this by making metadata a first-class, structured, computable artifact — not a document attachment. It can express both sides of a data contract:
- Supply: "Here is what we are delivering, and this is its structure, origin, and derivation."
- Demand: "Here is what we need, with the expected structure and semantics."
The model is built in LinkML and structured around a small set of core concepts that nest and compose together.
Everything lives inside a MetaDataVersion — a complete, self-contained snapshot of all metadata for a given study or context. It holds all item groups, items, code lists, methods, conditions, concepts, and data products.
Each version is immutable once created. If something changes, a new version is derived from the old one using wasDerivedFrom, preserving a full audit trail. This "copy-and-link" approach — similar to how Git handles snapshots — ensures regulatory reproducibility and prevents cascade failures when upstream standards change.
These are the workhorses of the model.
- An
Itemis a single data element: a variable with a name, data type, code list, derivation method, and origin. - An
ItemGroupis a collection of items — a dataset, a FHIR resource profile, a biomedical concept specialisation, or a form section.
ItemGroups can contain slices (nested sub-groups), enabling parameter-specific definitions. For example, a Vital Signs dataset can have a top-level IG.VS group with individual slices for DIABP, TEMP, WEIGHT, etc. — each with their own items, where-clauses, and constraints.
- A
Conditionis a reusable logical expression — composable and nestable, using AND/OR operators or formal expressions. - A
WhereClauseattaches conditions to a structure, defining when a particular context applies (e.g. a value list that only applies whenVSTESTCD = "DIABP"). - A
RangeCheckperforms simple value comparisons, resolving to pass/fail (soft warnings or hard errors).
- A
Methodis a reusable computational procedure describing how to derive a value. Items reference methods; methods contain formal expressions. - An
AnalysisextendsMethodto capture analysis-specific context: the reason, purpose, input datasets, and traceability. - A
FormalExpressionholds the actual executable logic, in a named context (e.g. SAS, R, Python), with typed parameters and a return value.
- A
CodeListdefines the permissible values for an item — either as inlineCodeListItementries or as a reference to an external code list. - Each
CodeListItemhas a coded value, a decoded label, an optional weight (for scoring), and anotherflag. - A
Dictionaryreferences external medical coding systems (e.g. MedDRA, SNOMED).
This is what makes Define-JSON more than just a structural schema.
-
Codingattaches standardised semantic tags to any element — a code, code system, and version. Think of it as linking a field to an ontology entry. - A
ReifiedConceptmakes an abstract concept explicit and referenceable — for example, a CDISC Biomedical Concept like "Diastolic Blood Pressure". AnItemGrouporItemcan declare that it implements a concept. -
ConceptPropertydefines a typed, constrained property within a concept — the blueprint that concrete items specialise from.
Together, these three classes form a semantic bridge that allows different implementations across CDISC, FHIR, OMOP, and SDMX to be compared and mapped to shared meaning.
The model also supports describing data at the product and pipeline level:
- A
DataProductis a governed collection of datasets and services — with an owner, lifecycle status, and explicit input/output interfaces. - A
Dataflowis an abstract description of a data provision agreement (what structure is expected, what dimensions are constrained). - A
Datasetis a concrete collection of observations sharing the same dimensionality, linked to aDataStructureDefinition. - A
Distributionrepresents how a dataset is made available — its format and access service.
Several mixins add cross-cutting metadata to most classes:
| Mixin | What it adds |
|---|---|
Governed |
mandatory, owner, purpose, lastUpdated, wasDerivedFrom
|
Identifiable |
OID, uuid, aliases
|
Labelled |
name, label, description, translations |
Versioned |
version, href, resource references |
IsProfile |
FHIR profile, security tags, validityPeriod
|
The wasDerivedFrom slot (inspired by PROV-O) is used throughout: items can derive from template items, versions from prior versions, analyses from upstream datasets.
Define-JSON is explicitly designed as a Rosetta Stone across clinical and health data standards:
| Standard | Role in Define-JSON |
|---|---|
| CDISC ODM / Define-XML | Primary heritage; full bidirectional conversion supported |
| CDISC Dataset-JSON | The data layer; Define-JSON is the accompanying metadata |
| CDISC USDM | Study protocol layer; Condition can reference USDM conditions |
| FHIR | ItemGroups can model FHIR resource profiles; IsProfile mixin |
| SDMX |
DataStructureDefinition, Dimension, Measure, DataAttribute map directly |
| OMOP | Concepts and properties can map via ReifiedConcept
|
| RDF / Linked Data | Full URI support; LinkML generates OWL/RDF serialisations |
The quickest entry points depending on your use case:
If you have an existing Define-XML file:
poetry run python -m define_json xml2json data/define.xml data/output.jsonIf you have Dataset-JSON data and want to reverse-engineer metadata:
python scripts/reverse_engineer_define.py examples/sample_dataset_lb.jsonIf you want to explore interactively:
cd notebooks && jupyter lab datacube_end_to_end.ipynbFull documentation: https://temeta.github.io/define-json
Source repository: https://github.com/TeMeta/define-json
- Contracts over documents — metadata should be structured and computable, not a PDF or XML attachment.
- Context-specific definitions — every element is defined within its context; the same variable can have different meanings in different ItemGroups.
- Immutable versioning — snapshots with explicit provenance, never in-place mutation.
- Semantic grounding — every element can be anchored to an ontology or abstract concept, enabling cross-standard comparison.
- Supply and demand symmetry — the same model describes what a provider delivers and what a consumer requires.
© 2026 Clinical Data Interchange Standards Consortium
CDISC is a 501(c)(3) global nonprofit charitable organization with administrative offices in Austin, Texas, with hundreds of employees, volunteers, and member organizations around the world.