Skip to content

feature request: per-run sqlite metadata database at demux/RunID/metadata/ #187

@georgemarselis-nvi

Description

@georgemarselis-nvi

Each demultiplexed run should have a self-contained sqlite database at demux/RunID/metadata/run.db that makes the run portable and self-describing independent of any central infrastructure.

Created by: the pipeline at the start of process_run(), updated at each pipeline stage completion.

Schema (initial):

CREATE TABLE run (
    run_id          TEXT PRIMARY KEY,
    instrument      TEXT,
    run_date        TEXT,
    flowcell_id     TEXT,
    operator        TEXT,
    note            TEXT,
    status          TEXT,
    created_at      TEXT,
    updated_at      TEXT
);

CREATE TABLE files (
    path            TEXT PRIMARY KEY,
    filename        TEXT,
    sha512          TEXT,
    size_bytes      INTEGER,
    stage           TEXT,       -- bcl2fastq, fastqc, multiqc, delivery, etc.
    created_at      TEXT
);

CREATE TABLE delivery (
    destination     TEXT,       -- NIRD, VIGASP
    nird_path       TEXT,
    status          TEXT,       -- pending, delivered, failed
    delivered_at    TEXT,
    sha512_verified INTEGER     -- boolean
);

CREATE TABLE pipeline_stages (
    stage           TEXT PRIMARY KEY,
    status          TEXT,       -- pending, running, complete, failed, skipped
    started_at      TEXT,
    completed_at    TEXT
);

CREATE TABLE tags (
    tag             TEXT PRIMARY KEY,   -- badrun, control
    set_by          TEXT,
    set_at          TEXT
);

Database access control:

Filesystem permissions are the primary access control:

chmod 700 demux/RunID/metadata/
chown <service_user> demux/RunID/metadata/

The database password is derived deterministically from the run metadata so it is computable from first principles without being stored anywhere:

import hashlib
password = hashlib.sha512(
    f"{run_id}{flowcell_id}{run_date}{instrument_serial}".encode()
).hexdigest()

Any process that knows the run metadata can recompute the password. No password file, no secrets management, no static string to lose.

The derived password is a lightweight secondary deterrent only. It is intentionally not a strong cryptographic boundary - its purpose is to keep casual users out and make the password self-documenting. Filesystem permissions are what actually controls access.

If stronger encryption is required in the future, SQLCipher is the recommended upgrade path - it is a drop-in sqlite replacement with identical Python API and works with the same password derivation scheme without architectural changes.

Relationship to postgres:

This sqlite file is the portable snapshot. The postgres database is the operational store. Sync direction is sqlite -> postgres on pipeline completion or daemon ingestion. Runs predating postgres can be backfilled by reading their sqlite files.

Lifecycle:

  • created at process_run() start
  • updated at each stage transition
  • survives clean - it is metadata, not intermediate output
  • removed by delete along with all of demux/RunID/
  • included in export-rawdata and archive tarballs (see separate ticket)

Acceptance criteria:

  • demux/RunID/metadata/run.db exists after a successful run.
  • demux/RunID/metadata/ is owned by the service user with permissions 700.
  • all pipeline stage transitions are recorded with timestamps.
  • sha512 hashes of all files under demux/RunID/ are recorded.
  • NIRD delivery paths match those used by deliver_files_to_NIRD().
  • database password is deterministically derivable from run metadata alone.
  • schema is forward-compatible with the planned postgres migration.

Related: reconstruct-rawdata ticket, reconstruct-samplesheet ticket, --create-samplesheet-with-paths ticket, future postgres migration

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions