Skip to content

Commit c81dd63

Browse files
docs: align relational workflow model page with 2026 preprint
Rewrite core definitions to match the paper's formal articulations: paradigm table, table tiers, dependencies, DAG, workflow normalization, query algebra (algebraic closure), active vs passive schemas. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e2272d4 commit c81dd63

File tree

1 file changed

+109
-158
lines changed

1 file changed

+109
-158
lines changed
Lines changed: 109 additions & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,62 @@
11
# The Relational Workflow Model
22

3-
DataJoint implements the **Relational Workflow Model**—a paradigm that extends
4-
relational databases with native support for computational workflows. This model
5-
defines a new class of databases called **Computational Databases**, where
6-
computational transformations are first-class citizens of the data model.
7-
8-
The relational workflow model and computational databases are formally defined in
9-
[Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585). DataJoint's schema
10-
definition language and query algebra were first formalized in
11-
[Yatsenko et al., 2018](https://doi.org/10.48550/arXiv.1807.11104).
12-
13-
## The Problem with Traditional Approaches
14-
15-
Traditional relational databases excel at storing and querying data but struggle
16-
with computational workflows. They can store inputs and outputs, but:
17-
18-
- The database doesn't understand that outputs were *computed from* inputs
19-
- It doesn't automatically recompute when inputs change
20-
- It doesn't track provenance
21-
22-
**DataJoint solves these problems by treating your database schema as an
23-
executable workflow specification.**
3+
The relational data model has historically been interpreted through two
4+
conceptual frameworks. The relational workflow model introduces a third
5+
paradigm: **tables represent workflow steps, rows represent workflow
6+
artifacts, and foreign key dependencies prescribe execution order.** This
7+
adds an operational dimension absent from both predecessors—the schema
8+
specifies not only what data exists but how it is derived.
9+
10+
The relational workflow model and its technical innovations are formally
11+
defined in [Yatsenko & Nguyen, 2026](https://arxiv.org/abs/2602.16585).
12+
DataJoint's schema definition language and query algebra were first
13+
formalized in [Yatsenko et al., 2018](https://doi.org/10.48550/arXiv.1807.11104).
2414

2515
## Three Paradigms Compared
2616

27-
The relational data model has been interpreted through different conceptual
28-
frameworks, each with distinct strengths and limitations:
29-
3017
| Aspect | Mathematical (Codd) | Entity-Relationship (Chen) | **Relational Workflow (DataJoint)** |
3118
|--------|---------------------|----------------------------|-------------------------------------|
32-
| **Core Question** | "What functional dependencies exist?" | "What entity types exist?" | **"When/how are entities created?"** |
33-
| **Time Dimension** | Not addressed | Not central | **Fundamental** |
34-
| **Implementation Gap** | High (abstract to SQL) | High (ERM to SQL) | **None (unified approach)** |
35-
| **Workflow Support** | None | None | **Native workflow modeling** |
19+
| **Core question** | What functional dependencies exist? | What entity types exist? | **When/how are entities created?** |
20+
| **Table semantics** | Logical predicate | Entity or relationship | **Workflow step** |
21+
| **Row semantics** | True proposition | Entity instance | **Workflow artifact** |
22+
| **Foreign keys** | Referential integrity | Relationship | **Execution order** |
23+
| **Computation** | Not addressed | Not addressed | **Declared in schema** |
24+
| **Provenance** | Not addressed | Not addressed | **Structural** |
25+
| **Implementation gap** | High | High | **None** |
3626

3727
### Codd's Mathematical Foundation
3828

39-
Edgar F. Codd's original relational model is rooted in predicate calculus and
40-
set theory. Tables represent logical predicates; rows assert true propositions.
41-
While mathematically rigorous, this approach requires abstract reasoning that
42-
doesn't map to intuitive domain thinking.
29+
Codd's mathematical foundation views tables as logical predicates and rows as
30+
true propositions—rigorous but abstract.
4331

4432
### Chen's Entity-Relationship Model
4533

46-
Peter Chen's Entity-Relationship Model (ERM) shifted focus to concrete domain
47-
modeling—entities and relationships visualized in diagrams. However, ERM:
34+
Chen's Entity-Relationship Model shifted focus to domain modeling with
35+
entities, attributes, and relationships—more intuitive, but lacking any
36+
workflow or computational dimension.
4837

49-
- Creates a gap between conceptual design and SQL implementation
50-
- Lacks temporal dimension ("when" entities are created)
51-
- Treats relationships as static connections, not dynamic processes
38+
## Core Concepts
5239

53-
## The Relational Workflow Model
40+
### Workflow Steps and Artifacts
5441

55-
The Relational Workflow Model introduces four fundamental concepts:
42+
Tables are classified into tiers by data entry mode:
5643

57-
### 1. Workflow Entities
44+
| Tier | Role | `make()` |
45+
|------|------|----------|
46+
| **Manual** | Receive direct user entry | No |
47+
| **Lookup** | Hold reference data | No |
48+
| **Imported** | Reach out to data sources outside the DataJoint system (instruments, electronic lab notebooks, external databases) | Yes |
49+
| **Computed** | Derive their contents entirely from upstream DataJoint tables | Yes |
5850

59-
Unlike traditional entities that exist independently, **workflow entities** are
60-
artifacts of workflow execution—they represent the products of specific
61-
operations. This temporal dimension allows us to understand not just *what*
62-
exists, but *when* and *how* it came to exist.
51+
Imported and Computed tables define computations via `make()` methods. The
52+
`make()` method specifies how each entity is derived—this computation logic is
53+
declared within the table definition, making it part of the schema itself
54+
rather than an external workflow specification.
6355

64-
### 2. Workflow Dependencies
56+
### Dependencies as Foreign Keys
6557

66-
**Workflow dependencies** extend foreign keys with operational semantics. They
67-
don't just ensure referential integrity—they prescribe the order of operations.
68-
Parent entities must be created before child entities.
58+
Foreign keys define computational dependencies, not only referential integrity.
59+
The dependency graph is explicit, queryable, and enforced by the database.
6960

7061
```mermaid
7162
graph LR
@@ -74,142 +65,102 @@ graph LR
7465
C --> D[Analysis]
7566
```
7667

77-
### 3. Workflow Steps (Table Tiers)
78-
79-
Each table represents a distinct **workflow step** with a specific role:
80-
81-
```mermaid
82-
graph TD
83-
subgraph "Lookup (Gray)"
84-
L[Parameters]
85-
end
86-
subgraph "Manual (Green)"
87-
M[Subject]
88-
S[Session]
89-
end
90-
subgraph "Imported (Blue)"
91-
I[Recording]
92-
end
93-
subgraph "Computed (Red)"
94-
C[Analysis]
95-
end
96-
97-
L --> C
98-
M --> S
99-
S --> I
100-
I --> C
101-
```
102-
103-
| Tier | Role | Examples |
104-
|------|------|----------|
105-
| **Lookup** | Reference data, parameters | Species, analysis methods |
106-
| **Manual** | Human-entered observations | Subjects, sessions |
107-
| **Imported** | Automated data acquisition | Recordings, images |
108-
| **Computed** | Derived results | Analyses, statistics |
109-
110-
### 4. Directed Acyclic Graph (DAG)
68+
### Master-Part Relationships
11169

112-
The schema forms a **DAG** that:
70+
Master-part relationships declare transactional grouping directly in the
71+
schema: the master table represents the workflow step, while part tables hold
72+
the individual items. Insertions and deletions cascade as a unit, enforcing
73+
transactional semantics without application code.
11374

114-
- Prohibits circular dependencies
115-
- Ensures valid execution sequences
116-
- Enables efficient parallel execution
117-
- Supports resumable computation
75+
### Directed Acyclic Graph
11876

119-
## The Workflow Normalization Principle
77+
Dependencies between tables form a directed acyclic graph (DAG); aggregated
78+
dependencies between schemas likewise form a DAG. Unlike task DAGs in
79+
workflow managers, these are *relational schema* DAGs—they define data
80+
structure and relationships, not just execution steps.
12081

121-
> **"Every table represents an entity type that is created at a specific step
122-
> in a workflow, and all attributes describe that entity as it exists at that
123-
> workflow step."**
82+
## Active Schemas
12483

125-
This principle extends entity normalization with temporal and operational
126-
dimensions.
84+
The key distinction from classical models: traditional schemas are
85+
*passive*—containers for data produced by external processes. In the
86+
relational workflow model, the schema is *active*—Computed tables declare how
87+
their contents are derived, making the schema itself the workflow
88+
specification. Schemas are defined as Python classes, and entire pipelines are
89+
organized as self-contained code repositories—version-controlled, testable,
90+
and deployable using standard software engineering practices.
12791

128-
## Why This Matters
92+
A useful analogy: electronic spreadsheets unified data and computation—cells
93+
with values alongside cells with formulas. Yet this integration never
94+
penetrated relational databases in their 50+ years of history. The relational
95+
workflow model brings to databases what spreadsheets brought to tabular
96+
calculation: the recognition that data and the computations that produce it
97+
belong together. The analogy has limits: spreadsheets' coupling is also the
98+
source of their well-known fragility. DataJoint addresses this through formal
99+
schema constraints and explicit dependency declaration rather than ad-hoc cell
100+
references.
129101

130-
### Unified Design and Implementation
102+
## Workflow Normalization
131103

132-
Unlike the ERM-SQL gap, DataJoint provides unified:
104+
> **"Every table represents an entity type created at a specific workflow
105+
> step, and all attributes describe that entity as it exists at that step."**
133106
134-
- **Diagramming** — Schema diagrams reflect actual structure
135-
- **Definition** — Table definitions are executable code
136-
- **Querying** — Operators understand workflow semantics
107+
Database normalization decomposes data into tables to eliminate redundancy.
108+
Classical normalization theory achieves this through normal forms based on
109+
functional dependencies. Entity normalization asks whether each attribute
110+
describes the entity identified by the primary key. Workflow normalization
111+
extends these principles with a temporal dimension.
137112

138-
No translation needed between conceptual design and implementation.
113+
A Session table contains attributes known when the session is entered (date,
114+
experimenter, subject). Analysis parameters determined later belong in
115+
Computed tables that depend on Session. This discipline prevents tables that
116+
accumulate attributes from different workflow stages, obscuring provenance and
117+
complicating updates.
139118

140-
### Temporal and Operational Awareness
119+
## Entity Integrity
141120

142-
The model captures the dynamic nature of workflows:
121+
All data is represented as well-formed entity sets with primary keys
122+
identifying each entity uniquely. This eliminates redundancy and ensures
123+
consistent updates.
143124

144-
- Data processing sequences
145-
- Computational dependencies
146-
- Operation ordering
125+
When upstream data is deleted, dependent results cascade-delete
126+
automatically—including associated objects in external storage. To correct
127+
errors, you delete, reinsert, and recompute, ensuring every result represents
128+
a consistent computation from valid inputs.
147129

148-
### Immutability and Provenance
130+
## Query Algebra
149131

150-
Workflow artifacts are immutable once created:
151-
152-
- Preserves execution history
153-
- Maintains data provenance
154-
- Enables reproducible science
155-
156-
When you delete upstream data, dependent results cascade-delete automatically.
157-
To correct errors, you delete, reinsert, and recompute—ensuring every result
158-
represents a consistent computation from valid inputs.
159-
160-
### Workflow Integrity
161-
162-
The DAG structure guarantees:
163-
164-
- No circular dependencies
165-
- Valid operation sequences
166-
- Enforced temporal order
167-
- Computational validity
168-
169-
## Query Algebra with Workflow Semantics
170-
171-
DataJoint's five operators provide a complete query algebra:
132+
DataJoint provides a five-operator algebra:
172133

173134
| Operator | Symbol | Purpose |
174135
|----------|--------|---------|
175-
| **Restriction** | `&` | Filter entities |
176-
| **Join** | `*` | Combine from converging paths |
177-
| **Projection** | `.proj()` | Select/compute attributes |
178-
| **Aggregation** | `.aggr()` | Summarize groups |
179-
| **Union** | `+` | Combine parallel branches |
180-
181-
These operators:
182-
183-
- Take entity sets as input, produce entity sets as output
184-
- Preserve entity integrity
185-
- Respect declared dependencies (no ambiguous joins)
136+
| **Restrict** | `&` | Filter entities by attribute values or membership in other relations |
137+
| **Project** | `.proj()` | Select and rename attributes, compute derived values |
138+
| **Join** | `*` | Combine related entities across relations |
139+
| **Aggregate** | `.aggr()` | Group entities and compute summary statistics |
140+
| **Union** | `+` | Combine entity sets with compatible structure |
141+
142+
The algebra achieves *algebraic closure*: every operator produces a valid
143+
entity set with a well-defined primary key, enabling unlimited composition.
144+
This preservation of entity integrity—every query result is itself a proper
145+
entity set with clear identity—distinguishes DataJoint's algebra from SQL,
146+
where query results lack both a well-defined primary key and a clear entity
147+
type.
186148

187149
## From Transactions to Transformations
188150

189-
The Relational Workflow Model represents a conceptual shift:
190-
191151
| Traditional View | Workflow View |
192152
|------------------|---------------|
193-
| Tables store data | Entity sets are workflow steps |
194-
| Rows are records | Entities are execution instances |
195-
| Foreign keys enforce consistency | Dependencies specify information flow |
153+
| Tables store data | Tables represent workflow steps |
154+
| Rows are records | Rows are workflow artifacts |
155+
| Foreign keys enforce consistency | Foreign keys prescribe execution order |
196156
| Updates modify state | Computations create new states |
197157
| Schemas organize storage | Schemas specify pipelines |
198158
| Queries retrieve data | Queries trace provenance |
199159

200-
This makes DataJoint feel less like a traditional database and more like a
201-
**workflow engine with persistent state**—one that maintains computational
202-
validity while supporting scientific flexibility.
203-
204160
## Summary
205161

206-
The Relational Workflow Model:
207-
208-
1. **Extends** relational theory (doesn't replace it)
209-
2. **Adds** temporal and operational semantics
210-
3. **Eliminates** the design-implementation gap
211-
4. **Enables** reproducible computational workflows
212-
5. **Maintains** mathematical rigor
213-
214-
It's not a departure from relational databases—it's their evolution for
215-
computational workflows.
162+
The relational workflow model offers a new way to understand relational
163+
databases—not merely as storage systems but as computational substrates. By
164+
interpreting tables as workflow steps and foreign keys as execution
165+
dependencies, the schema becomes a complete specification of how data is
166+
derived, not just what data exists.

0 commit comments

Comments
 (0)