Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,9 @@ node_modules
.DS_Store
tmp
*.tsv.gz

# OMOP sample data is generated on-demand by omop/scripts/setup.sh
omop/data/

# Local dev helper for running OMOP Malloy queries; not part of the sample
omop/scripts/run_malloy.cjs
22 changes: 22 additions & 0 deletions omop/LICENSE_NOTICES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# License notices for the OMOP sample

The sample data is the **`synthea-covid19-10k`** dataset: an OMOP CDM v5.3
synthetic dataset generated with [Synthea](https://github.com/synthetichealth/synthea)
(Apache License 2.0) and the [ETL-Synthea](https://github.com/OHDSI/ETL-Synthea)
pipeline, distributed by the OHDSI / darwin-eu community as a
[CDMConnector](https://darwin-eu.github.io/CDMConnector/) example dataset. It is
downloaded by `omop/scripts/setup.sh` from the public CDMConnector example-data
store.

The data is generated synthetically and contains no real patient information
(no PHI).

The dataset bundles the OMOP standardized vocabularies (SNOMED, RxNorm, LOINC,
and others) via the [OHDSI Athena](https://athena.ohdsi.org/) project. Concept
names are reproduced here for analytical readability under those vocabularies'
redistribution terms; please attribute OHDSI Athena and the individual source
vocabularies in any derivative work.

The Malloy semantic model (`omop.malloy`), the dataset configuration
(`omop_synthea_covid.malloy`), and the notebooks in this directory are released
under the same [MIT License](../LICENSE) as the rest of `malloy-samples`.
123 changes: 123 additions & 0 deletions omop/README.malloynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
>>>markdown
# OMOP CDM Sample - Synthea COVID-19

## What is OMOP?

The **OMOP Common Data Model** is the dominant open standard for structuring observational health data across health systems. Maintained by [OHDSI](https://ohdsi.org/) (Observational Health Data Sciences and Informatics), OMOP is used by hundreds of academic medical centers, health systems, and research networks globally to standardize electronic health records into a single schema. This means a cohort study, propensity score analysis, or phenotype built on one OMOP database runs unchanged on any other - enabling evidence synthesis across health systems at scale. Learn more at [OMOP CDM docs](https://ohdsi.github.io/CommonDataModel/).

## Why Malloy for OMOP?

OMOP's power comes from standardization, but its standard schema creates inherent complexity in querying. Here's what Malloy solves:

- **The Join Explosion Problem**: OMOP practitioners routinely write queries like "show me patients with Type 2 Diabetes who were prescribed Metformin during an inpatient stay and later developed kidney disease." This simple question requires joining a patient to conditions, drugs, visits, and procedures tables - often with multiple events per patient. In SQL, this leads to Cartesian products where row counts explode and aggregates become unreliable. Malloy's **symmetric aggregates** solve this automatically. When you define `patient -> conditions.count()` and `patient -> drugs.count()` in the same query, both counts are correct even though drugs and conditions both fan-out from a single patient record. The model semantics prevent the classic OMOP fan-out bug.

- **Reusable Complex Logic**: Clinical researchers repeatedly define the same cohort logic, temporal sequences, and demographic breakdowns across studies. SQL forces copy-paste; errors creep in. Malloy's **named views** let you define these patterns once in the semantic model (`patient.demographics`, `patient.cohort_with_washout_period`) and call them by name in any analysis. Queries stay readable and errors stay caught in one place.

- **Vocabulary Lookups Made Simple**: OMOP stores clinical facts as numeric codes, not words. A drug exposure isn't recorded as "amlodipine" - it's `drug_concept_id = 1332419`; a patient's gender isn't "Female," it's `8532`. To turn those numbers back into readable labels, you join them to the `concept` table - OMOP's master dictionary of every standardized code (RxNorm drugs, SNOMED conditions, ICD-10, and more). In SQL you rewrite that lookup join every time you want a human-readable name - once per code column, in every query. Malloy's **reusable joins** let you declare each lookup once in the model under a plain-English alias (`gender`, `drug_concept`, `condition_concept`), so any query just asks for `drug_concept.concept_name` and gets "amlodipine 5 MG Oral Tablet" back - no repeated joins, and no memorizing which numeric column maps to which vocabulary.

- **Temporal Reasoning**: Sequence-of-events queries (drug A → condition B → drug C) are error-prone in raw SQL because they involve self-joins and window functions. Malloy's nested results and filtering let you reason about sequences more naturally: "for each patient, list their conditions ordered by date, then for each condition, show drugs given afterward."

The result: researchers spend time on medicine, not SQL debugging. Analyses are reproducible across institutions because the Malloy model captures the clinical logic once, correctly.

## About this sample

The Malloy model (`omop.malloy`) is **generic for OMOP CDM v5.3 and v5.4**. The model defines tables, joins, and measures that conform to the OMOP standard — it's database-agnostic. Note: OMOP 5.4 renamed four visit columns (e.g., `admitting_source_concept_id` → `admitted_from_concept_id`), but this model doesn't use those fields, so it works unchanged with both versions.

This set of Malloy examples uses the **`synthea-covid19-10k`** dataset - ~10,700 synthetic patients built with [Synthea](https://github.com/synthetichealth/synthea)'s COVID-19 module and the [ETL-Synthea](https://github.com/OHDSI/ETL-Synthea) pipeline, distributed by the OHDSI / darwin-eu community as a [CDMConnector](https://darwin-eu.github.io/CDMConnector/) example dataset (OMOP CDM v5.3). It has real demographic diversity (race/ethnicity), a populated `death` table, and the full OMOP vocabulary - enough for a realistic, stratified "Table 1" and mortality analyses.

What it does **not** have: laboratory results. The OMOP `measurement` and `observation` tables are empty in this extract, so lab/vital-based analyses are out of scope here. The `drug_exposure` dosing fields (`days_supply`, `quantity`, `refills`) are also unpopulated - `days_supply` is ~99% zero with a few artifact values - so prescription-length analyses are out of scope too.

**To point the model at a different OMOP v5.3 or v5.4 database:** edit the table paths in [`omop_synthea_covid.malloy`](omop_synthea_covid.malloy) to reference your parquet or SQL tables. The semantic model in [`omop.malloy`](omop.malloy) and all the notebooks work unchanged.

\> **Before opening other notebooks:** run `bash omop/scripts/setup.sh` once from the repo root. It downloads the dataset (~840 MB) from the CDMConnector example-data blob and extracts the tables this sample uses (~195 MB) as Parquet under `omop/data/` (gitignored). The files are already Parquet, so there is no conversion step. The data includes no PHI.

## What's in the model

[`omop.malloy`](omop.malloy) defines a **generic semantic model for OMOP CDM v5.3 and v5.4**. Every OMOP database has these same tables and relationships; this model captures them once:

**Vocabulary backbone**
- **`concept`** - the OMOP vocabulary (standardized codes: RxNorm drugs, SNOMED conditions, ICD-10, etc.). One table, reused under many join aliases (`gender`, `condition_concept`, `drug_concept`, `race`, `ethnicity`). Solves the "Vocabulary Lookups Made Simple" problem: instead of repeatedly joining to concept and filtering, the model captures vocabulary mapping once, and every query gets readable condition names, drug names, and gender/race labels for free.

**Patient-centric tables**
- **`person`** - patient demographics (age, gender, race, ethnicity), joined to `concept` for human-readable demographic values. Single source of truth for patient identifiers and baseline characteristics.
>>>malloy
import 'omop.malloy'

# bar_chart
run: patient -> demographics
>>>markdown

- **`condition_occurrence`**, **`drug_exposure`**, **`visit_occurrence`**, **`procedure_occurrence`** - event-level fact tables (one row per event), each with:
- Foreign key to `person` (which patient had this event)
- Foreign key to `concept` (the standardized code - e.g., RxNorm, SNOMED, ICD-10)
- Temporal columns (start_date, end_date)
- This structure is identical across all OMOP databases, enabling model portability.
- **`death`** - one row per deceased patient (date + cause). Joined one-to-one onto `patient`, it powers the `vital_status` dimension and mortality outcomes.
- **`observation_period`** - the window over which a patient is observed; the model derives `followup_years` from it.

**The symmetric aggregates pattern**
- **`patient`** - extends `person` with `join_many:` relationships to conditions, drugs, visits, and procedures. This source solves the "Join Explosion Problem" by encoding the fan-out structure in the model:
- `patient -> conditions.count()` correctly counts conditions per patient
- `patient -> drugs.count()` correctly counts drugs per patient
- Both can appear in the same query with correct totals, despite both joining many-to-one
- Pre-built measures (`total_conditions`, `total_drugs`, etc.) let queries reuse the same aggregation logic
>>>malloy
run: patient -> top_conditions
>>>malloy
run: patient -> top_drugs
>>>markdown

**Named views for clinical patterns**
- `patient.demographics` - groups patients by gender and age band with counts. Reusable template for stratified analysis.
- `patient.top_conditions`, `patient.top_drugs` - shows the most frequent conditions and drugs across the cohort. Eliminates copy-paste of "top N by frequency" logic.
- These views enable the "Reusable Complex Logic" pattern: define a cohort, demography breakdown, or scoring rule once in the model; call it by name in any analysis.

## A few simple examples

Let's look at a few simple examples:

We start with a quick look at the patient population. The `patient.demographics` view rolls patients up by gender × age band. This demonstrates Malloy's **named views**: reusable query templates defined once in the model and called by name. The `patient.demographics` view encapsulates grouping and aggregation logic, reducing copy-paste and keeping the model semantically clear.
>>>markdown
## Where to go next

The notebooks below are a learning path. Each stage builds on the Malloy concepts
from the one before, so if you're new to Malloy or OMOP, work through them top to
bottom. Within a notebook the cells also ramp from a simple query to a more
involved one.

### 1. Basics: query a single table

| Notebook | What you'll do | Malloy concepts / ideas |
|---|---|---|
| [`vocabulary_explorer.malloynb`](vocabulary_explorer.malloynb) | Search the OMOP vocabulary for the condition / drug names you'll filter on | `where`, regex match (`~`), `select`, reusing a model measure |
| [`prevalence.malloynb`](prevalence.malloynb) | Count a condition over time, then break it down by age band | `group_by` + `aggregate`, reaching a dimension through a join, `all()` for share-of-total |

### 2. The patient model: explore the fact tables

These introduce the `patient` source and *symmetric aggregates* - the feature that keeps per-patient counts correct when conditions, drugs, visits, and procedures all fan out from one patient.

| Notebook | What you'll do | New Malloy concept |
|---|---|---|
| [`procedures.malloynb`](procedures.malloynb) | Top procedures, by demographics, and co-occurring with a condition | `top:`, `patient` symmetric aggregates, two-stage "aggregate then analyze" pipeline |
| [`drug_exposure.malloynb`](drug_exposure.malloynb) | Top drugs, days-supply spread, drug↔condition co-occurrence | symmetric aggregates across two join paths |
| [`healthcare_utilization.malloynb`](healthcare_utilization.malloynb) | Visits by setting, visits per patient, utilization by outcome group | multi-stage pipelines, the cohort → `join_one: patient` pattern |

### 3. Define and reuse cohorts

| Notebook | What you'll do | New Malloy concept |
|---|---|---|
| [`cohorts.malloynb`](cohorts.malloynb) | Define a study population with condition + drug filters; list its comorbidities | multi-clause `where:` across join-many, `nest:` for drill-downs |
| [`parameterized_cohorts.malloynb`](parameterized_cohorts.malloynb) | Run the same cohort analysis for different conditions and compare | reusing one query shape by changing only the `where:` |

### 4. Outcomes over time: building to a Table 1

| Notebook | What you'll do | New Malloy concept |
|---|---|---|
| [`treatment_pathways.malloynb`](treatment_pathways.malloynb) | Rank each patient's prescriptions by date; find first-line therapy | window functions: `row_number()` with `partition_by`, rank-then-aggregate |
| [`comorbidity_burden.malloynb`](comorbidity_burden.malloynb) | Bucket patients by comorbidity count; show mortality rising with burden | filtered distinct count, `pick` bucketing, inline mortality numerator |
| [`cohort_outcomes_followup.malloynb`](cohort_outcomes_followup.malloynb) | Set an index event, follow forward to death: rate by age/race, time-to-death | index event with `min()`, `days()` intervals, longitudinal pipeline |
| [`baseline_characteristics.malloynb`](baseline_characteristics.malloynb) | Assemble a publishable, stratified **Table 1** (survived vs died) | `source:` definitions, filtered aggregates as pivots, wide→long reshape, `exclude()` |

For background on the Malloy patterns used here - symmetric aggregates, views, nested results - see the [`patterns/`](../patterns/) directory.
>>>markdown

Loading