malloydata · ofermend · Jun 1, 2026 · Jun 3, 2026 · Jun 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,9 @@ node_modules
 .DS_Store
 tmp
 *.tsv.gz
+
+# OMOP sample data is generated on-demand by omop/scripts/setup.sh
+omop/data/
+
+# Local dev helper for running OMOP Malloy queries; not part of the sample
+omop/scripts/run_malloy.cjs
diff --git a/omop/LICENSE_NOTICES.md b/omop/LICENSE_NOTICES.md
@@ -0,0 +1,22 @@
+# License notices for the OMOP sample
+
+The sample data is the **`synthea-covid19-10k`** dataset: an OMOP CDM v5.3
+synthetic dataset generated with [Synthea](https://github.com/synthetichealth/synthea)
+(Apache License 2.0) and the [ETL-Synthea](https://github.com/OHDSI/ETL-Synthea)
+pipeline, distributed by the OHDSI / darwin-eu community as a
+[CDMConnector](https://darwin-eu.github.io/CDMConnector/) example dataset. It is
+downloaded by `omop/scripts/setup.sh` from the public CDMConnector example-data
+store.
+
+The data is generated synthetically and contains no real patient information
+(no PHI).
+
+The dataset bundles the OMOP standardized vocabularies (SNOMED, RxNorm, LOINC,
+and others) via the [OHDSI Athena](https://athena.ohdsi.org/) project. Concept
+names are reproduced here for analytical readability under those vocabularies'
+redistribution terms; please attribute OHDSI Athena and the individual source
+vocabularies in any derivative work.
+
+The Malloy semantic model (`omop.malloy`), the dataset configuration
+(`omop_synthea_covid.malloy`), and the notebooks in this directory are released
+under the same [MIT License](../LICENSE) as the rest of `malloy-samples`.
diff --git a/omop/README.malloynb b/omop/README.malloynb
@@ -0,0 +1,123 @@
+>>>markdown
+# OMOP CDM Sample - Synthea COVID-19
+
+## What is OMOP?
+
+The **OMOP Common Data Model** is the dominant open standard for structuring observational health data across health systems. Maintained by [OHDSI](https://ohdsi.org/) (Observational Health Data Sciences and Informatics), OMOP is used by hundreds of academic medical centers, health systems, and research networks globally to standardize electronic health records into a single schema. This means a cohort study, propensity score analysis, or phenotype built on one OMOP database runs unchanged on any other - enabling evidence synthesis across health systems at scale. Learn more at [OMOP CDM docs](https://ohdsi.github.io/CommonDataModel/).
+
+## Why Malloy for OMOP?
+
+OMOP's power comes from standardization, but its standard schema creates inherent complexity in querying. Here's what Malloy solves:
+
+- **The Join Explosion Problem**: OMOP practitioners routinely write queries like "show me patients with Type 2 Diabetes who were prescribed Metformin during an inpatient stay and later developed kidney disease." This simple question requires joining a patient to conditions, drugs, visits, and procedures tables - often with multiple events per patient. In SQL, this leads to Cartesian products where row counts explode and aggregates become unreliable. Malloy's **symmetric aggregates** solve this automatically. When you define `patient -> conditions.count()` and `patient -> drugs.count()` in the same query, both counts are correct even though drugs and conditions both fan-out from a single patient record. The model semantics prevent the classic OMOP fan-out bug.
+
+- **Reusable Complex Logic**: Clinical researchers repeatedly define the same cohort logic, temporal sequences, and demographic breakdowns across studies. SQL forces copy-paste; errors creep in. Malloy's **named views** let you define these patterns once in the semantic model (`patient.demographics`, `patient.cohort_with_washout_period`) and call them by name in any analysis. Queries stay readable and errors stay caught in one place.
+
+- **Vocabulary Lookups Made Simple**: OMOP stores clinical facts as numeric codes, not words. A drug exposure isn't recorded as "amlodipine" - it's `drug_concept_id = 1332419`; a patient's gender isn't "Female," it's `8532`. To turn those numbers back into readable labels, you join them to the `concept` table - OMOP's master dictionary of every standardized code (RxNorm drugs, SNOMED conditions, ICD-10, and more). In SQL you rewrite that lookup join every time you want a human-readable name - once per code column, in every query. Malloy's **reusable joins** let you declare each lookup once in the model under a plain-English alias (`gender`, `drug_concept`, `condition_concept`), so any query just asks for `drug_concept.concept_name` and gets "amlodipine 5 MG Oral Tablet" back - no repeated joins, and no memorizing which numeric column maps to which vocabulary.
+
+- **Temporal Reasoning**: Sequence-of-events queries (drug A → condition B → drug C) are error-prone in raw SQL because they involve self-joins and window functions. Malloy's nested results and filtering let you reason about sequences more naturally: "for each patient, list their conditions ordered by date, then for each condition, show drugs given afterward."
+
+The result: researchers spend time on medicine, not SQL debugging. Analyses are reproducible across institutions because the Malloy model captures the clinical logic once, correctly.
+
+## About this sample
+
+The Malloy model (`omop.malloy`) is **generic for OMOP CDM v5.3 and v5.4**. The model defines tables, joins, and measures that conform to the OMOP standard — it's database-agnostic. Note: OMOP 5.4 renamed four visit columns (e.g., `admitting_source_concept_id` → `admitted_from_concept_id`), but this model doesn't use those fields, so it works unchanged with both versions.
+
+This set of Malloy examples uses the **`synthea-covid19-10k`** dataset - ~10,700 synthetic patients built with [Synthea](https://github.com/synthetichealth/synthea)'s COVID-19 module and the [ETL-Synthea](https://github.com/OHDSI/ETL-Synthea) pipeline, distributed by the OHDSI / darwin-eu community as a [CDMConnector](https://darwin-eu.github.io/CDMConnector/) example dataset (OMOP CDM v5.3). It has real demographic diversity (race/ethnicity), a populated `death` table, and the full OMOP vocabulary - enough for a realistic, stratified "Table 1" and mortality analyses.
+
+What it does **not** have: laboratory results. The OMOP `measurement` and `observation` tables are empty in this extract, so lab/vital-based analyses are out of scope here. The `drug_exposure` dosing fields (`days_supply`, `quantity`, `refills`) are also unpopulated - `days_supply` is ~99% zero with a few artifact values - so prescription-length analyses are out of scope too.
+
+**To point the model at a different OMOP v5.3 or v5.4 database:** edit the table paths in [`omop_synthea_covid.malloy`](omop_synthea_covid.malloy) to reference your parquet or SQL tables. The semantic model in [`omop.malloy`](omop.malloy) and all the notebooks work unchanged.
+
+\> **Before opening other notebooks:** run `bash omop/scripts/setup.sh` once from the repo root. It downloads the dataset (~840 MB) from the CDMConnector example-data blob and extracts the tables this sample uses (~195 MB) as Parquet under `omop/data/` (gitignored). The files are already Parquet, so there is no conversion step. The data includes no PHI.
+
+## What's in the model
+
+[`omop.malloy`](omop.malloy) defines a **generic semantic model for OMOP CDM v5.3 and v5.4**. Every OMOP database has these same tables and relationships; this model captures them once:
+
+**Vocabulary backbone**
+- **`concept`** - the OMOP vocabulary (standardized codes: RxNorm drugs, SNOMED conditions, ICD-10, etc.). One table, reused under many join aliases (`gender`, `condition_concept`, `drug_concept`, `race`, `ethnicity`). Solves the "Vocabulary Lookups Made Simple" problem: instead of repeatedly joining to concept and filtering, the model captures vocabulary mapping once, and every query gets readable condition names, drug names, and gender/race labels for free.
+
+**Patient-centric tables**
+- **`person`** - patient demographics (age, gender, race, ethnicity), joined to `concept` for human-readable demographic values. Single source of truth for patient identifiers and baseline characteristics.
+>>>malloy
+import 'omop.malloy'
+
+# bar_chart
+run: patient -> demographics
+>>>markdown
+
+- **`condition_occurrence`**, **`drug_exposure`**, **`visit_occurrence`**, **`procedure_occurrence`** - event-level fact tables (one row per event), each with:
+  - Foreign key to `person` (which patient had this event)
+  - Foreign key to `concept` (the standardized code - e.g., RxNorm, SNOMED, ICD-10)
+  - Temporal columns (start_date, end_date)
+  - This structure is identical across all OMOP databases, enabling model portability.
+- **`death`** - one row per deceased patient (date + cause). Joined one-to-one onto `patient`, it powers the `vital_status` dimension and mortality outcomes.
+- **`observation_period`** - the window over which a patient is observed; the model derives `followup_years` from it.
+
+**The symmetric aggregates pattern**
+- **`patient`** - extends `person` with `join_many:` relationships to conditions, drugs, visits, and procedures. This source solves the "Join Explosion Problem" by encoding the fan-out structure in the model:
+  - `patient -> conditions.count()` correctly counts conditions per patient
+  - `patient -> drugs.count()` correctly counts drugs per patient
+  - Both can appear in the same query with correct totals, despite both joining many-to-one
+  - Pre-built measures (`total_conditions`, `total_drugs`, etc.) let queries reuse the same aggregation logic
+>>>malloy
+run: patient -> top_conditions
+>>>malloy
+run: patient -> top_drugs
+>>>markdown
+
+**Named views for clinical patterns**
+- `patient.demographics` - groups patients by gender and age band with counts. Reusable template for stratified analysis.
+- `patient.top_conditions`, `patient.top_drugs` - shows the most frequent conditions and drugs across the cohort. Eliminates copy-paste of "top N by frequency" logic.
+- These views enable the "Reusable Complex Logic" pattern: define a cohort, demography breakdown, or scoring rule once in the model; call it by name in any analysis.
+
+## A few simple examples
+
+Let's look at a few simple examples:
+
+We start with a quick look at the patient population. The `patient.demographics` view rolls patients up by gender × age band. This demonstrates Malloy's **named views**: reusable query templates defined once in the model and called by name. The `patient.demographics` view encapsulates grouping and aggregation logic, reducing copy-paste and keeping the model semantically clear.
+>>>markdown
+## Where to go next
+
+The notebooks below are a learning path. Each stage builds on the Malloy concepts
+from the one before, so if you're new to Malloy or OMOP, work through them top to
+bottom. Within a notebook the cells also ramp from a simple query to a more
+involved one.
+
+### 1. Basics: query a single table
+
+| Notebook | What you'll do | Malloy concepts / ideas |
+|---|---|---|
+| [`vocabulary_explorer.malloynb`](vocabulary_explorer.malloynb) | Search the OMOP vocabulary for the condition / drug names you'll filter on | `where`, regex match (`~`), `select`, reusing a model measure |
+| [`prevalence.malloynb`](prevalence.malloynb) | Count a condition over time, then break it down by age band | `group_by` + `aggregate`, reaching a dimension through a join, `all()` for share-of-total |
+
+### 2. The patient model: explore the fact tables
+
+These introduce the `patient` source and *symmetric aggregates* - the feature that keeps per-patient counts correct when conditions, drugs, visits, and procedures all fan out from one patient.
+
+| Notebook | What you'll do | New Malloy concept |
+|---|---|---|
+| [`procedures.malloynb`](procedures.malloynb) | Top procedures, by demographics, and co-occurring with a condition | `top:`, `patient` symmetric aggregates, two-stage "aggregate then analyze" pipeline |
+| [`drug_exposure.malloynb`](drug_exposure.malloynb) | Top drugs, days-supply spread, drug↔condition co-occurrence | symmetric aggregates across two join paths |
+| [`healthcare_utilization.malloynb`](healthcare_utilization.malloynb) | Visits by setting, visits per patient, utilization by outcome group | multi-stage pipelines, the cohort → `join_one: patient` pattern |
+
+### 3. Define and reuse cohorts
+
+| Notebook | What you'll do | New Malloy concept |
+|---|---|---|
+| [`cohorts.malloynb`](cohorts.malloynb) | Define a study population with condition + drug filters; list its comorbidities | multi-clause `where:` across join-many, `nest:` for drill-downs |
+| [`parameterized_cohorts.malloynb`](parameterized_cohorts.malloynb) | Run the same cohort analysis for different conditions and compare | reusing one query shape by changing only the `where:` |
+
+### 4. Outcomes over time: building to a Table 1
+
+| Notebook | What you'll do | New Malloy concept |
+|---|---|---|
+| [`treatment_pathways.malloynb`](treatment_pathways.malloynb) | Rank each patient's prescriptions by date; find first-line therapy | window functions: `row_number()` with `partition_by`, rank-then-aggregate |
+| [`comorbidity_burden.malloynb`](comorbidity_burden.malloynb) | Bucket patients by comorbidity count; show mortality rising with burden | filtered distinct count, `pick` bucketing, inline mortality numerator |
+| [`cohort_outcomes_followup.malloynb`](cohort_outcomes_followup.malloynb) | Set an index event, follow forward to death: rate by age/race, time-to-death | index event with `min()`, `days()` intervals, longitudinal pipeline |
+| [`baseline_characteristics.malloynb`](baseline_characteristics.malloynb) | Assemble a publishable, stratified **Table 1** (survived vs died) | `source:` definitions, filtered aggregates as pivots, wide→long reshape, `exclude()` |
+
+For background on the Malloy patterns used here - symmetric aggregates, views, nested results - see the [`patterns/`](../patterns/) directory.
+>>>markdown
+