Skip to content

define json introduction

pendingintent edited this page May 28, 2026 · 1 revision

Define-JSON: Introduction to the Data Model

"Don't send data without its Definition"

What is Define-JSON?

Define-JSON is an open data model for describing clinical trial datasets — what they contain, where the data came from, how it was derived, and what it means. It was developed under CDISC as part of the 360i project, and is designed to complement three existing CDISC standards:

  • USDM (Unified Study Definitions Model) — the study protocol layer
  • ODM (Operational Data Model) — the data collection layer
  • Dataset-JSON — the data exchange layer

Where those standards handle the study design and the data itself, Define-JSON handles the metadata contract in between: the precise, machine-readable description of what data was agreed upon, what was delivered, and how the two relate.


The Core Problem It Solves

In clinical trials today, datasets and their definitions travel separately — typically as a Define-XML file attached to a submission package. This works for regulatory review but breaks down for everything else: data integration, automated transformation, cross-study comparison, and supplier agreements.

Define-JSON addresses this by making metadata a first-class, structured, computable artifact — not a document attachment. It can express both sides of a data contract:

  • Supply: "Here is what we are delivering, and this is its structure, origin, and derivation."
  • Demand: "Here is what we need, with the expected structure and semantics."

How the Model is Organised

The model is built in LinkML and structured around a small set of core concepts that nest and compose together.

The Container: MetaDataVersion

Everything lives inside a MetaDataVersion — a complete, self-contained snapshot of all metadata for a given study or context. It holds all item groups, items, code lists, methods, conditions, concepts, and data products.

Each version is immutable once created. If something changes, a new version is derived from the old one using wasDerivedFrom, preserving a full audit trail. This "copy-and-link" approach — similar to how Git handles snapshots — ensures regulatory reproducibility and prevents cascade failures when upstream standards change.


Data Structure: ItemGroup and Item

These are the workhorses of the model.

  • An Item is a single data element: a variable with a name, data type, code list, derivation method, and origin.
  • An ItemGroup is a collection of items — a dataset, a FHIR resource profile, a biomedical concept specialisation, or a form section.

ItemGroups can contain slices (nested sub-groups), enabling parameter-specific definitions. For example, a Vital Signs dataset can have a top-level IG.VS group with individual slices for DIABP, TEMP, WEIGHT, etc. — each with their own items, where-clauses, and constraints.


Logic and Conditions: Condition, WhereClause, RangeCheck

  • A Condition is a reusable logical expression — composable and nestable, using AND/OR operators or formal expressions.
  • A WhereClause attaches conditions to a structure, defining when a particular context applies (e.g. a value list that only applies when VSTESTCD = "DIABP").
  • A RangeCheck performs simple value comparisons, resolving to pass/fail (soft warnings or hard errors).

Derivation and Methods: Method, Analysis, FormalExpression

  • A Method is a reusable computational procedure describing how to derive a value. Items reference methods; methods contain formal expressions.
  • An Analysis extends Method to capture analysis-specific context: the reason, purpose, input datasets, and traceability.
  • A FormalExpression holds the actual executable logic, in a named context (e.g. SAS, R, Python), with typed parameters and a return value.

Controlled Vocabulary: CodeList, CodeListItem, Dictionary

  • A CodeList defines the permissible values for an item — either as inline CodeListItem entries or as a reference to an external code list.
  • Each CodeListItem has a coded value, a decoded label, an optional weight (for scoring), and an other flag.
  • A Dictionary references external medical coding systems (e.g. MedDRA, SNOMED).

Semantics: Coding, ReifiedConcept, ConceptProperty

This is what makes Define-JSON more than just a structural schema.

  • Coding attaches standardised semantic tags to any element — a code, code system, and version. Think of it as linking a field to an ontology entry.
  • A ReifiedConcept makes an abstract concept explicit and referenceable — for example, a CDISC Biomedical Concept like "Diastolic Blood Pressure". An ItemGroup or Item can declare that it implements a concept.
  • ConceptProperty defines a typed, constrained property within a concept — the blueprint that concrete items specialise from.

Together, these three classes form a semantic bridge that allows different implementations across CDISC, FHIR, OMOP, and SDMX to be compared and mapped to shared meaning.


Data Flow: DataProduct, Dataflow, Dataset, Distribution

The model also supports describing data at the product and pipeline level:

  • A DataProduct is a governed collection of datasets and services — with an owner, lifecycle status, and explicit input/output interfaces.
  • A Dataflow is an abstract description of a data provision agreement (what structure is expected, what dimensions are constrained).
  • A Dataset is a concrete collection of observations sharing the same dimensionality, linked to a DataStructureDefinition.
  • A Distribution represents how a dataset is made available — its format and access service.

Governance and Provenance

Several mixins add cross-cutting metadata to most classes:

Mixin What it adds
Governed mandatory, owner, purpose, lastUpdated, wasDerivedFrom
Identifiable OID, uuid, aliases
Labelled name, label, description, translations
Versioned version, href, resource references
IsProfile FHIR profile, security tags, validityPeriod

The wasDerivedFrom slot (inspired by PROV-O) is used throughout: items can derive from template items, versions from prior versions, analyses from upstream datasets.


Interoperability

Define-JSON is explicitly designed as a Rosetta Stone across clinical and health data standards:

Standard Role in Define-JSON
CDISC ODM / Define-XML Primary heritage; full bidirectional conversion supported
CDISC Dataset-JSON The data layer; Define-JSON is the accompanying metadata
CDISC USDM Study protocol layer; Condition can reference USDM conditions
FHIR ItemGroups can model FHIR resource profiles; IsProfile mixin
SDMX DataStructureDefinition, Dimension, Measure, DataAttribute map directly
OMOP Concepts and properties can map via ReifiedConcept
RDF / Linked Data Full URI support; LinkML generates OWL/RDF serialisations

Getting Started

The quickest entry points depending on your use case:

If you have an existing Define-XML file:

poetry run python -m define_json xml2json data/define.xml data/output.json

If you have Dataset-JSON data and want to reverse-engineer metadata:

python scripts/reverse_engineer_define.py examples/sample_dataset_lb.json

If you want to explore interactively:

cd notebooks && jupyter lab datacube_end_to_end.ipynb

Full documentation: https://temeta.github.io/define-json
Source repository: https://github.com/TeMeta/define-json


Key Design Principles

  1. Contracts over documents — metadata should be structured and computable, not a PDF or XML attachment.
  2. Context-specific definitions — every element is defined within its context; the same variable can have different meanings in different ItemGroups.
  3. Immutable versioning — snapshots with explicit provenance, never in-place mutation.
  4. Semantic grounding — every element can be anchored to an ontology or abstract concept, enabling cross-standard comparison.
  5. Supply and demand symmetry — the same model describes what a provider delivers and what a consumer requires.

Clone this wiki locally