Manage a Pipeline Project

Organize multi-schema pipelines for team collaboration.

Overview

A production DataJoint pipeline typically involves:

Multiple schemas — Organized by experimental modality or processing stage
Team of users — With different roles and access levels
Shared infrastructure — Database server, object storage, code repository
Coordination — Between code, database, and storage permissions

This guide covers practical project organization. For conceptual background on pipeline architecture and the DAG structure, see Data Pipelines.

For a fully managed solution, request a DataJoint Platform account.

Project Structure

Use a modern Python project layout with source code under src/:

my_pipeline/
├── datajoint.json          # Shared settings (committed)
├── .secrets/               # Local credentials (gitignored)
│   ├── database.password
│   └── storage.credentials
├── .gitignore
├── pyproject.toml          # Package metadata and dependencies
├── README.md
├── src/
│   └── my_pipeline/
│       ├── __init__.py
│       ├── subject.py      # subject schema
│       ├── session.py      # session schema
│       ├── ephys.py        # ephys schema
│       ├── imaging.py      # imaging schema
│       ├── analysis.py     # analysis schema
│       └── utils/
│           └── __init__.py
├── tests/
│   ├── conftest.py
│   └── test_ephys.py
└── docs/
    └── ...

One Module Per Schema

Each module defines and binds to its schema:

# src/my_pipeline/ephys.py
import datajoint as dj
from . import session  # Import dependency

schema = dj.Schema('ephys')

@schema
class Probe(dj.Lookup):
    definition = """
    probe_type : varchar(32)
    ---
    num_channels : int32
    """

@schema
class Recording(dj.Imported):
    definition = """
    -> session.Session
    -> Probe
    ---
    recording_path : varchar(255)
    """

Import Dependencies Mirror Foreign Keys

Module imports reflect the schema DAG:

# analysis.py depends on both ephys and imaging
from . import ephys
from . import imaging

schema = dj.Schema('analysis')

@schema
class MultiModalAnalysis(dj.Computed):
    definition = """
    -> ephys.Recording
    -> imaging.Scan
    ---
    correlation : float64
    """

Repository Configuration

Shared Settings

Store non-secret configuration in datajoint.json at the project root:

datajoint.json (committed):

{
  "database": {
    "host": "db.example.com",
    "port": 3306
  },
  "stores": {
    "main": {
      "protocol": "s3",
      "endpoint": "s3.example.com",
      "bucket": "my-org-data",
      "location": "my_pipeline"
    }
  }
}

Credentials Management

Credentials are stored locally and never committed:

Option 1: .secrets/ directory

.secrets/
├── database.user
├── database.password
├── storage.access_key
└── storage.secret_key

Option 2: Environment variables

export DJ_USER=alice
export DJ_PASS=alice_password
export DJ_STORES__MAIN__ACCESS_KEY=...
export DJ_STORES__MAIN__SECRET_KEY=...

Essential `.gitignore`

# Credentials
.secrets/

# Python
__pycache__/
*.pyc
*.egg-info/
dist/
build/

# Environment
.env
.venv/

# IDE
.idea/
.vscode/

`pyproject.toml` Example

[project]
name = "my-pipeline"
version = "1.0.0"
requires-python = ">=3.10"
dependencies = [
    "datajoint>=2.0",
    "numpy",
]

[project.optional-dependencies]
dev = ["pytest", "pytest-cov"]

[tool.setuptools.packages.find]
where = ["src"]

Database Access Control

The Complexity

Multi-user database access requires:

User accounts — Individual credentials per team member
Schema permissions — Which users can access which schemas
Operation permissions — SELECT, INSERT, UPDATE, DELETE, CREATE, DROP
Role hierarchy — Admin, developer, analyst, viewer
Audit trail — Who modified what and when

Basic MySQL Grants

-- Create user
CREATE USER 'alice'@'%' IDENTIFIED BY 'password';

-- Grant read-only on specific schema
GRANT SELECT ON ephys.* TO 'alice'@'%';

-- Grant read-write on specific schema
GRANT SELECT, INSERT, UPDATE, DELETE ON analysis.* TO 'alice'@'%';

-- Grant full access (developers)
GRANT ALL PRIVILEGES ON my_pipeline_*.* TO 'bob'@'%';

Role-Based Access Patterns

Role	Permissions	Typical Use
Viewer	SELECT	Browse data, run queries
Analyst	SELECT, INSERT on analysis	Add analysis results
Operator	SELECT, INSERT, DELETE on data schemas	Run pipeline
Developer	ALL on development schemas	Schema changes
Admin	ALL + GRANT	User management

Considerations

Users need SELECT on parent schemas to INSERT into child schemas (FK validation)
Cascading deletes require DELETE on all dependent schemas
Schema creation requires CREATE privilege
Coordinating permissions across many schemas becomes complex

Object Storage Access Control

The Complexity

Object storage permissions must align with database permissions:

Bucket/prefix policies — Map to schema access
Read vs write — Match SELECT vs INSERT/UPDATE
Credential distribution — Per-user or shared service accounts
Cross-schema objects — When computed tables reference multiple inputs

Hierarchical Storage Structure

A DataJoint project creates a structured storage pattern:

📁 project_name/
├── 📁 schema_name1/
├── 📁 schema_name2/
├── 📁 schema_name3/
│   ├── objects/
│   │   └── table1/
│   │       └── key1-value1/
│   └── fields/
│       └── table1-field1/
└── ...

S3/MinIO Policy Example

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my-lab-data/datajoint/ephys/*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::my-lab-data/datajoint/analysis/*"
    }
  ]
}

Considerations

Object paths include schema name: {project}/{schema}/{table}/...
Users need read access to fetch blobs from upstream schemas
Content-addressed storage (<blob@>) shares objects across tables
Garbage collection requires coordinated delete permissions

Pipeline Initialization

Schema Creation Order

Initialize schemas in dependency order:

# src/my_pipeline/__init__.py
from . import subject   # No dependencies
from . import session   # Depends on subject
from . import ephys     # Depends on session
from . import imaging   # Depends on session
from . import analysis  # Depends on ephys, imaging

def initialize():
    """Create all schemas in dependency order."""
    # Schemas are created when modules are imported
    # and tables are first accessed
    subject.Subject()
    session.Session()
    ephys.Recording()
    imaging.Scan()
    analysis.MultiModalAnalysis()

Version Coordination

Track schema versions with your code:

# src/my_pipeline/version.py
__version__ = "1.2.0"

SCHEMA_VERSIONS = {
    'subject': '1.0.0',
    'session': '1.1.0',
    'ephys': '1.2.0',
    'imaging': '1.2.0',
    'analysis': '1.2.0',
}

Team Workflows

Development vs Production

┌─────────────────┐     ┌─────────────────┐
│   Development   │     │   Production    │
├─────────────────┤     ├─────────────────┤
│ dev_subject     │     │ subject         │
│ dev_session     │     │ session         │
│ dev_ephys       │     │ ephys           │
└─────────────────┘     └─────────────────┘
        │                       │
        │    Schema promotion   │
        └───────────────────────┘

Branching Strategy

main ────────────────────────────────────▶
       │              │
       │ feature/     │ hotfix/
       ▼              ▼
    ephys-v2      fix-recording
       │              │
       └──────────────┴──▶ main

Summary of Complexities

Managing a team pipeline requires coordinating:

Component	Challenges
Code	Module dependencies, version control, deployment
Database	User accounts, schema permissions, role hierarchy
Object Storage	Bucket policies, credential distribution, path alignment
Compute	Worker deployment, job distribution, resource allocation
Monitoring	Progress tracking, error alerting, audit logging

These challenges grow with team size and pipeline complexity. The DataJoint Platform provides integrated management for all these concerns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage a Pipeline Project

Overview

Project Structure

One Module Per Schema

Import Dependencies Mirror Foreign Keys

Repository Configuration

Shared Settings

Credentials Management

Essential `.gitignore`

`pyproject.toml` Example

Database Access Control

The Complexity

Basic MySQL Grants

Role-Based Access Patterns

Considerations

Object Storage Access Control

The Complexity

Hierarchical Storage Structure

S3/MinIO Policy Example

Considerations

Pipeline Initialization

Schema Creation Order

Version Coordination

Team Workflows

Development vs Production

Branching Strategy

Summary of Complexities

See Also

FilesExpand file tree

manage-pipeline-project.md

Latest commit

History

manage-pipeline-project.md

File metadata and controls

Manage a Pipeline Project

Overview

Project Structure

One Module Per Schema

Import Dependencies Mirror Foreign Keys

Repository Configuration

Shared Settings

Credentials Management

Essential .gitignore

pyproject.toml Example

Database Access Control

The Complexity

Basic MySQL Grants

Role-Based Access Patterns

Considerations

Object Storage Access Control

The Complexity

Hierarchical Storage Structure

S3/MinIO Policy Example

Considerations

Pipeline Initialization

Schema Creation Order

Version Coordination

Team Workflows

Development vs Production

Branching Strategy

Summary of Complexities

See Also

Essential `.gitignore`

`pyproject.toml` Example