Conversation
The refgenconf RGC and select_genome_config imports in pipelines/pepatac.py are never referenced anywhere else in the file. Confirmed via grep for \bRGC\b and select_genome. Adds tools/audit_refgenie_surface.sh which enumerates all refgenconf / RefGenConf / looper_refgenie_populate / REFGENIE references in the repo (sources, docs, tests, configs). Adds findings.md as the dogfooding deliverable for the refgenie1 branch. Initial sections: audit, API gaps, seek-key naming divergences, asset class gaps, CLI/install gaps, cluster integration. Validation run section will be filled by step 9 of the plan. Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (step 2)
- var_templates: refgenie_config ($REFGENIE) → refgenie_db_config ($REFGENIE_DB_CONFIG_PATH). Refgenie1 uses its own env var. - pre_submit.python_functions: refgenconf.looper_refgenie_populate → refgenie.looper_refgenie_populate_local. The new populator lives in refgenie1's refgenie/populator.py (added on the refgenie/refgenie1#nsheff-refactor-2 branch alongside this one). - Jinja: refgenie[g].bowtie2_index.dir → bowtie2_index.bowtie2_index. Refgenie1's bowtie2_index asset class has no 'dir' seek_key (the legacy refgenconf 'dir' built-in was removed); the bowtie2_index seek_key returns the index prefix path which is what bowtie2 -x consumes. Same change for bwa_index. - pipelines/pepatac.yaml resources.genome_config: $REFGENIE → $REFGENIE_DB_CONFIG_PATH (consistency, the field is unused but was confusing). Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (steps 4-5) Findings: see findings.md for the full divergence audit.
requirements.txt: refgenconf>=0.12.2 → refgenie>=1.0.0 requirements-conda.yml: refgenconf==0.12.2, refgenie==0.12.1 → refgenie>=1.0.0 (single dep — refgenie1 supersedes both). Note: refgenie 1.0.x and legacy refgenie 0.12.x share the PyPI name 'refgenie' but are different packages. The >=1.0.0 pin disambiguates. This is a known footgun — see findings.md. Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (step 6)
docs/assets.md gets the canonical refgenie 1.0 setup walk-through.
The other refgenie-touching docs (tutorial, run-conda, detailed-install,
run-bulker, howto/install-refgenie) get a NOTE banner pointing at
docs/assets.md plus inline replacements of:
pip install refgenie → pip install "refgenie>=1.0.0"
export REFGENIE=... → export REFGENIE_HOME_PATH +
REFGENIE_DB_CONFIG_PATH
refgenie init -c ... → refgenie init
refgenie pull g/a → refgenie genome init <fasta> &&
refgenie add g/a --recipe <r>
This is the minimum migration to keep the docs consistent with the
branch. The legacy refgenie 0.12.x flow has no automatic CLI shim in
refgenie 1.0; users on that flow must follow the new commands.
Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (step 7)
Adds findings discovered during the end-to-end Rivanna validation: - Refgenie() rejects str for database_config_path (signature says Path | None but no coercion → AttributeError deep in get_database_config) - Looper _update_namespaces requires the namespace to pre-exist on input — the populator must mutate the input dict (not just return) - Looper 2.1.x dropped the positional config argument; needs -c - Refgenie1 venv has no pip; need uv pip install - bulker activate must be wrapped in eval "$(...)" to take effect Each finding has a one-line recommended fix. Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (steps 9-11)
End-to-end PEPATAC run on Rivanna against refgenie1-registered hg38 + rCRSd assets. Wall-clock 2:16 on a 4-core 12GB node (SLURM 12499800). Pipeline ran successfully through: trimming (skewer) → fastqc → rCRSd prealignment (bowtie2 + refgenie1's rCRSd index) → hg38 alignment (bowtie2 + refgenie1's hg38 index) → sort/index → dedup (samblaster) → fragment classification (NFR/mono/di/tri/poly BAMs) → genome size from refgenie1 chrom_sizes It failed at signal generation in gtars-uniwig with a Rust panic on a BAM header (missing VN: tag in @pg record). That's a gtars-rs / PEPATAC bug, NOT a refgenie1 issue. The refgenie1 dogfooding goal is met: every refgenie1 asset path PEPATAC referenced was resolved correctly and consumed by the pipeline. tests/refgenie1_validation/RUN_NOTES.md captures cluster paths for the artefacts (sort.bam, sort_dedup.bam, fragment-class BAMs, prealignment summary, fastqc reports). findings.md gets the full run report appended. Plan: assistant/pepatac_refgenie1_branch_plan_v1.md (steps 9, 11)
jpsmith5
added a commit
that referenced
this pull request
May 11, 2026
…ps3dp/tools/refgenie_config.yaml workaround - faq.md: expand the TSSE entry to name the refgene_anno asset / UCSC RefGene as the source of TSS coords and note that the cutoff-of-6 threshold is hg38-tuned and empirical. Point at ENCODE ATAC-seq data standards for per-assembly reference numbers. (#235) - assets.md: add a Using a custom adapter file subsection documenting the adapters resource override in pipelines/pepatac.yaml. (#252) - assets.md: document the /home/jps3dp/tools/refgenie_config.yaml-required-even-with-manual-paths quirk and the empty-refgenie-config workaround. The proper fix is in the in-progress refgenie 1.0 migration (PR #327). (#251) - count_table.md: make the per-sample PEPATAC_completed.flag handling explicit in the consensus-peak-set count table workflow. Two paths: delete the flag files (one-liner with find -delete) or pass --ignore-flags to looper run. (#215) - assets.md: troubleshooting subsection for TypeError: 'NoneType' object is not iterable — root-caused to incomplete refgenie assets (commonly missing prealignment FASTA), with diagnostic and fix commands. The error itself is upstream refgenconf behavior; replaced by the refgenie 1.0 migration (PR #327). (#216) - glossary.md: document column formats for _peaks_coverage.bed (8 columns) and _ref_peaks_coverage.bed (15 columns; narrowPeak coordinates + bedtools coverage stats + normalized count). (#233) - assets.md: Running a non-refgenie genome through looper subsection — sample_modifiers/imply pattern with chrom_sizes, genome_index, etc. set per-sample. (#231 docs portion) Closes #235, #252, #251, #215, #216, #233.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Do not merge. This PR exists so reviewers can read the diff and the
findings; it is not a merge candidate.
Summary
Hard-cut migration of PEPATAC from legacy
refgenconftorefgenie1.0(refgenie1, the SQLModel-backed reimplementation). Removes the dead
refgenconfimport inpipelines/pepatac.py, replaces the populatorhook in
sample_pipeline_interface.yamlwith refgenie1'slooper_refgenie_populate_local, swaps\$REFGENIE→\$REFGENIE_DB_CONFIG_PATHin var_templates, renamesbowtie2_index.dir→bowtie2_index.bowtie2_indexin Jinja(refgenie1's bowtie2_index asset class has no
dirseek_key —the prefix path is what bowtie2 -x consumes anyway), updates
requirements.txtandrequirements-conda.yml, and rewrites therefgenie-touching docs to use refgenie 1.0 CLI syntax.
Companion changes (refgenie1)
The looper-style local populator is added on
refgenie/refgenie1@nsheff-refactor-2:refgenie/populator.py—looper_refgenie_populate_local(drop-inreplacement for
refgenconf.looper_refgenie_populate)refgenie/__init__.py— exporttests/test_populator.py— unit tests against the session refgeniefixture
Two bugs surfaced and were fixed there:
Refgenie(database_config_path=...)rejected str (Path-only contract,no coercion) → coerce to Path in the populator.
_update_namespacesrequires the populated namespace keyto pre-exist on input → mutate input dict before returning.
Findings (dogfooding)
The full set of refgenie1 API gaps, naming divergences, asset-class
gaps, CLI/install footguns, and cluster-integration issues found
during this migration are recorded in
findings.mdat the repo root.Highlights:
Refgenie.list_seek_keys_values()equivalent — the populatorhas to walk every (genome, group, asset, seek_key) leaf manually.
bowtie2_index.diris gone (no built-indirseek_key in refgenie1).Refgenie.asset.seekreturnsPath(refgenconf returnedstr).refgenie0.12.x and refgenie1 1.0.xboth publish under
refgenie. Pinrefgenie>=1.0.0.\$REFGENIE(legacy) →\$REFGENIE_DB_CONFIG_PATH(refgenie1).Anyone with
\$REFGENIEin.bashrcneeds migration guidance.refgenie pull g/ais gone; replaced byrefgenie genome init+refgenie add g/a --recipe r.--versionflag and noassetsubcommand.~/.local/bin/refgenieshadows refgenie1 unless
refgenie1.envis sourced.Validation
End-to-end on Rivanna (SLURM 12499800, 2:16 wall-clock, 4 cores). The
populator delivers a fully-resolved command line; PEPATAC consumes
refgenie1 paths for hg38 (fasta, chrom_sizes, refgene_tss, blacklist,
feat_annotation, bowtie2_index) and rCRSd (fasta, bowtie2_index)
without modification. Pipeline ran through trimming → fastqc →
prealignment → primary alignment → sort/index → dedup → fragment
classification, and FAILED at signal generation in gtars-uniwig due
to a Rust panic on BAM header parsing (unrelated to refgenie1 — a
gtars-rs / PEPATAC bug).
Outputs (sort.bam, dedup BAM, fragment classes) live on cluster at
`/project/shefflab/brickyard/results_analysis/atacbase/forge/pilot/refgenie1_validation/results_pipeline/results_pipeline/test1/`.
Cluster run log:
`/project/shefflab/brickyard/results_analysis/atacbase/forge/pilot/refgenie1_validation/run_validation.log`.
Plan
`assistant/pepatac_refgenie1_branch_plan_v1.md` — Phase 3 of
`assistant/accbase_refgenie1_dogfood_metaplan_v1.md`.
Test plan