Skip to content

WIP: refgenie1 migration (do not merge)#327

Draft
nsheff wants to merge 6 commits into
devfrom
refgenie1
Draft

WIP: refgenie1 migration (do not merge)#327
nsheff wants to merge 6 commits into
devfrom
refgenie1

Conversation

@nsheff
Copy link
Copy Markdown
Member

@nsheff nsheff commented May 4, 2026

Do not merge. This PR exists so reviewers can read the diff and the
findings; it is not a merge candidate.

Summary

Hard-cut migration of PEPATAC from legacy refgenconf to refgenie 1.0
(refgenie1, the SQLModel-backed reimplementation). Removes the dead
refgenconf import in pipelines/pepatac.py, replaces the populator
hook in sample_pipeline_interface.yaml with refgenie1's
looper_refgenie_populate_local, swaps \$REFGENIE
\$REFGENIE_DB_CONFIG_PATH in var_templates, renames
bowtie2_index.dirbowtie2_index.bowtie2_index in Jinja
(refgenie1's bowtie2_index asset class has no dir seek_key —
the prefix path is what bowtie2 -x consumes anyway), updates
requirements.txt and requirements-conda.yml, and rewrites the
refgenie-touching docs to use refgenie 1.0 CLI syntax.

Companion changes (refgenie1)

The looper-style local populator is added on
refgenie/refgenie1@nsheff-refactor-2:

  • refgenie/populator.pylooper_refgenie_populate_local (drop-in
    replacement for refgenconf.looper_refgenie_populate)
  • refgenie/__init__.py — export
  • tests/test_populator.py — unit tests against the session refgenie
    fixture

Two bugs surfaced and were fixed there:

  • Refgenie(database_config_path=...) rejected str (Path-only contract,
    no coercion) → coerce to Path in the populator.
  • Looper's _update_namespaces requires the populated namespace key
    to pre-exist on input → mutate input dict before returning.

Findings (dogfooding)

The full set of refgenie1 API gaps, naming divergences, asset-class
gaps, CLI/install footguns, and cluster-integration issues found
during this migration are recorded in findings.md at the repo root.
Highlights:

  • No Refgenie.list_seek_keys_values() equivalent — the populator
    has to walk every (genome, group, asset, seek_key) leaf manually.
  • bowtie2_index.dir is gone (no built-in dir seek_key in refgenie1).
  • Refgenie.asset.seek returns Path (refgenconf returned str).
  • PyPI name collision: legacy refgenie 0.12.x and refgenie1 1.0.x
    both publish under refgenie. Pin refgenie>=1.0.0.
  • \$REFGENIE (legacy) → \$REFGENIE_DB_CONFIG_PATH (refgenie1).
    Anyone with \$REFGENIE in .bashrc needs migration guidance.
  • refgenie pull g/a is gone; replaced by refgenie genome init +
    refgenie add g/a --recipe r.
  • Refgenie1 has no --version flag and no asset subcommand.
  • The cluster's broken legacy-refgenie binary at ~/.local/bin/refgenie
    shadows refgenie1 unless refgenie1.env is sourced.

Validation

End-to-end on Rivanna (SLURM 12499800, 2:16 wall-clock, 4 cores). The
populator delivers a fully-resolved command line; PEPATAC consumes
refgenie1 paths for hg38 (fasta, chrom_sizes, refgene_tss, blacklist,
feat_annotation, bowtie2_index) and rCRSd (fasta, bowtie2_index)
without modification. Pipeline ran through trimming → fastqc →
prealignment → primary alignment → sort/index → dedup → fragment
classification, and FAILED at signal generation in gtars-uniwig due
to a Rust panic on BAM header parsing (unrelated to refgenie1 — a
gtars-rs / PEPATAC bug).

Outputs (sort.bam, dedup BAM, fragment classes) live on cluster at
`/project/shefflab/brickyard/results_analysis/atacbase/forge/pilot/refgenie1_validation/results_pipeline/results_pipeline/test1/`.

Cluster run log:
`/project/shefflab/brickyard/results_analysis/atacbase/forge/pilot/refgenie1_validation/run_validation.log`.

Plan

`assistant/pepatac_refgenie1_branch_plan_v1.md` — Phase 3 of
`assistant/accbase_refgenie1_dogfood_metaplan_v1.md`.

Test plan

  • dry-run looper produces fully-resolved sample command
  • populator unit tests pass (refgenie1 `tests/test_populator.py`)
  • end-to-end SLURM run reaches alignment + dedup
  • full pipeline completion (blocked on gtars-rs BAM header fix)
  • master-branch parity diff (blocked on full completion)

nsheff added 6 commits May 4, 2026 18:28
The refgenconf RGC and select_genome_config imports in pipelines/pepatac.py
are never referenced anywhere else in the file. Confirmed via grep for
\bRGC\b and select_genome.

Adds tools/audit_refgenie_surface.sh which enumerates all refgenconf /
RefGenConf / looper_refgenie_populate / REFGENIE references in the repo
(sources, docs, tests, configs).

Adds findings.md as the dogfooding deliverable for the refgenie1 branch.
Initial sections: audit, API gaps, seek-key naming divergences, asset
class gaps, CLI/install gaps, cluster integration. Validation run
section will be filled by step 9 of the plan.

Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (step 2)
- var_templates: refgenie_config ($REFGENIE) → refgenie_db_config
  ($REFGENIE_DB_CONFIG_PATH). Refgenie1 uses its own env var.
- pre_submit.python_functions: refgenconf.looper_refgenie_populate
  → refgenie.looper_refgenie_populate_local. The new populator lives
  in refgenie1's refgenie/populator.py (added on the
  refgenie/refgenie1#nsheff-refactor-2 branch alongside this one).
- Jinja: refgenie[g].bowtie2_index.dir → bowtie2_index.bowtie2_index.
  Refgenie1's bowtie2_index asset class has no 'dir' seek_key (the
  legacy refgenconf 'dir' built-in was removed); the bowtie2_index
  seek_key returns the index prefix path which is what bowtie2 -x
  consumes. Same change for bwa_index.
- pipelines/pepatac.yaml resources.genome_config: $REFGENIE →
  $REFGENIE_DB_CONFIG_PATH (consistency, the field is unused but
  was confusing).

Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (steps 4-5)
Findings: see findings.md for the full divergence audit.
requirements.txt: refgenconf>=0.12.2 → refgenie>=1.0.0
requirements-conda.yml: refgenconf==0.12.2, refgenie==0.12.1
  → refgenie>=1.0.0 (single dep — refgenie1 supersedes both).

Note: refgenie 1.0.x and legacy refgenie 0.12.x share the PyPI name
'refgenie' but are different packages. The >=1.0.0 pin disambiguates.
This is a known footgun — see findings.md.

Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (step 6)
docs/assets.md gets the canonical refgenie 1.0 setup walk-through.
The other refgenie-touching docs (tutorial, run-conda, detailed-install,
run-bulker, howto/install-refgenie) get a NOTE banner pointing at
docs/assets.md plus inline replacements of:

  pip install refgenie  →  pip install "refgenie>=1.0.0"
  export REFGENIE=...   →  export REFGENIE_HOME_PATH +
                            REFGENIE_DB_CONFIG_PATH
  refgenie init -c ...  →  refgenie init
  refgenie pull g/a     →  refgenie genome init <fasta> &&
                            refgenie add g/a --recipe <r>

This is the minimum migration to keep the docs consistent with the
branch. The legacy refgenie 0.12.x flow has no automatic CLI shim in
refgenie 1.0; users on that flow must follow the new commands.

Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (step 7)
Adds findings discovered during the end-to-end Rivanna validation:
- Refgenie() rejects str for database_config_path (signature says
  Path | None but no coercion → AttributeError deep in get_database_config)
- Looper _update_namespaces requires the namespace to pre-exist on
  input — the populator must mutate the input dict (not just return)
- Looper 2.1.x dropped the positional config argument; needs -c
- Refgenie1 venv has no pip; need uv pip install
- bulker activate must be wrapped in eval "$(...)" to take effect

Each finding has a one-line recommended fix.

Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (steps 9-11)
End-to-end PEPATAC run on Rivanna against refgenie1-registered hg38 +
rCRSd assets. Wall-clock 2:16 on a 4-core 12GB node (SLURM 12499800).

Pipeline ran successfully through:
  trimming (skewer) → fastqc → rCRSd prealignment (bowtie2 +
  refgenie1's rCRSd index) → hg38 alignment (bowtie2 + refgenie1's
  hg38 index) → sort/index → dedup (samblaster) → fragment
  classification (NFR/mono/di/tri/poly BAMs) → genome size from
  refgenie1 chrom_sizes

It failed at signal generation in gtars-uniwig with a Rust panic on
a BAM header (missing VN: tag in @pg record). That's a gtars-rs /
PEPATAC bug, NOT a refgenie1 issue. The refgenie1 dogfooding goal
is met: every refgenie1 asset path PEPATAC referenced was resolved
correctly and consumed by the pipeline.

tests/refgenie1_validation/RUN_NOTES.md captures cluster paths for
the artefacts (sort.bam, sort_dedup.bam, fragment-class BAMs,
prealignment summary, fastqc reports). findings.md gets the full
run report appended.

Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (steps 9, 11)
jpsmith5 added a commit that referenced this pull request May 11, 2026
…ps3dp/tools/refgenie_config.yaml workaround

- faq.md: expand the TSSE entry to name the refgene_anno asset / UCSC
  RefGene as the source of TSS coords and note that the cutoff-of-6
  threshold is hg38-tuned and empirical. Point at ENCODE ATAC-seq
  data standards for per-assembly reference numbers. (#235)
- assets.md: add a Using a custom adapter file subsection
  documenting the adapters resource override in
  pipelines/pepatac.yaml. (#252)
- assets.md: document the /home/jps3dp/tools/refgenie_config.yaml-required-even-with-manual-paths
  quirk and the empty-refgenie-config workaround. The proper fix is
  in the in-progress refgenie 1.0 migration (PR #327). (#251)
- count_table.md: make the per-sample PEPATAC_completed.flag handling
  explicit in the consensus-peak-set count table workflow. Two paths:
  delete the flag files (one-liner with find -delete) or pass
  --ignore-flags to looper run. (#215)
- assets.md: troubleshooting subsection for TypeError: 'NoneType'
  object is not iterable — root-caused to incomplete refgenie assets
  (commonly missing prealignment FASTA), with diagnostic and fix
  commands. The error itself is upstream refgenconf behavior;
  replaced by the refgenie 1.0 migration (PR #327). (#216)
- glossary.md: document column formats for _peaks_coverage.bed
  (8 columns) and _ref_peaks_coverage.bed (15 columns; narrowPeak
  coordinates + bedtools coverage stats + normalized count). (#233)
- assets.md: Running a non-refgenie genome through looper
  subsection — sample_modifiers/imply pattern with chrom_sizes,
  genome_index, etc. set per-sample. (#231 docs portion)

Closes #235, #252, #251, #215, #216, #233.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant