Docs: TSSE cutoff guidance, custom adapter, count-table flag, /home/jps3dp/tools/refgenie_config.yaml workaround

jpsmith5 · jpsmith5 · commit 41fbb9ed5e8f · 2026-05-11T09:18:07.000-04:00
- faq.md: expand the TSSE entry to name the refgene_anno asset / UCSC RefGene as the source of TSS coords and note that the cutoff-of-6 threshold is hg38-tuned and empirical. Point at ENCODE ATAC-seq data standards for per-assembly reference numbers. (#235) - assets.md: add a Using a custom adapter file subsection documenting the adapters resource override in pipelines/pepatac.yaml. (#252) - assets.md: document the /home/jps3dp/tools/refgenie_config.yaml-required-even-with-manual-paths quirk and the empty-refgenie-config workaround. The proper fix is in the in-progress refgenie 1.0 migration (PR #327). (#251) - count_table.md: make the per-sample PEPATAC_completed.flag handling explicit in the consensus-peak-set count table workflow. Two paths: delete the flag files (one-liner with find -delete) or pass --ignore-flags to looper run. (#215) - assets.md: troubleshooting subsection for TypeError: 'NoneType' object is not iterable — root-caused to incomplete refgenie assets (commonly missing prealignment FASTA), with diagnostic and fix commands. The error itself is upstream refgenconf behavior; replaced by the refgenie 1.0 migration (PR #327). (#216) - glossary.md: document column formats for _peaks_coverage.bed (8 columns) and _ref_peaks_coverage.bed (15 columns; narrowPeak coordinates + bedtools coverage stats + normalized count). (#233) - assets.md: Running a non-refgenie genome through looper subsection — sample_modifiers/imply pattern with chrom_sizes, genome_index, etc. set per-sample. (#231 docs portion) Closes #235, #252, #251, #215, #216, #233.
diff --git a/docs/assets.md b/docs/assets.md
@@ -64,12 +64,75 @@ looper run examples/test_project/test_config_refgenie.yaml
 
 Assets may also be managed manually and specified directly to the pipeline.  While this frees you from needing `refgenie` installed and initialized, it does require a few more arguments to be specified.
 
+> **Note**: even when you provide every path manually, the pipeline interface (`sample_pipeline_interface.yaml`) currently runs a `refgenconf.looper_refgenie_populate` pre-submit hook that expands `$REFGENIE`. If `$REFGENIE` is unset you'll see `FileNotFoundError: [Errno 2] No such file or directory: '$REFGENIE'`. The simplest workaround is to point `$REFGENIE` at an empty refgenie config — the hook succeeds against an empty config, and your manual paths are used regardless:
+>
+> ```console
+> refgenie init -c /tmp/empty_refgenie.yaml
+> export REFGENIE=/tmp/empty_refgenie.yaml
+> ```
+>
+> A proper "no refgenie" path is being addressed as part of the in-progress refgenie 1.0 migration.
+
+### Running a non-refgenie genome through `looper`
+
+If your samples use a genome that isn't in your refgenie config (e.g. `galGal6`, `bosTau9`, an unaligned custom assembly), the pipeline interface jinja template will only succeed if every required asset path is provided at the sample level — otherwise the `refgenie[sample.genome].<asset>` lookup falls through to a missing-key error. Set these in your PEP `sample_modifiers` block (alongside `genome`):
+
+```yaml
+sample_modifiers:
+  imply:
+    - if:
+        organism: ["chicken"]
+      then:
+        genome: galGal6
+        chrom_sizes: /path/to/galGal6.chrom.sizes
+        genome_index: /path/to/galGal6_bowtie2_index/galGal6
+        # Optional, only if you have them:
+        TSS_name: /path/to/galGal6_TSS.bed
+        blacklist: /path/to/galGal6_blacklist.bed
+        anno_name: /path/to/galGal6_feat_annotation.bed.gz
+        genome_size: "1.05e9"
+```
+
+Each `sample.X` attribute short-circuits the corresponding `refgenie[sample.genome].X` lookup in `sample_pipeline_interface.yaml`, so refgenie is never queried for that genome. Samples using a refgenie-managed genome (e.g. `hg38`, `mm10`) and samples using a manually-managed genome can be processed in the same project.
+
+### Troubleshooting: `TypeError: 'NoneType' object is not iterable`
+
+If `looper run` fails before the pipeline starts with this trace, ending in `refgenconf/populator.py` → `refgenconf/refgenconf.py` → `for seek_key_name in get_tag_seek_keys(tag_mapping)` → `TypeError: 'NoneType' object is not iterable`, the cause is **a refgenie genome with one or more incomplete assets**. The pre-submit hook iterates every asset for every registered genome, and one of them returned `None`. This commonly happens when prealignment genomes are partially pulled (e.g. `rCRSd/bowtie2_index` is pulled but `rCRSd/fasta` is not).
+
+To diagnose, list every asset in your refgenie config:
+
+```console
+refgenie list -g rCRSd       # list assets for one genome
+refgenie list                # list all genomes
+```
+
+Then pull anything missing — the most common gap is the prealignment FASTA:
+
+```console
+refgenie pull rCRSd/fasta
+refgenie pull human_repeats/fasta
+```
+
+The unhelpful error message is upstream behavior in `refgenconf` and is replaced wholesale by the in-progress refgenie 1.0 migration (PR #327).
+
 Custom blacklisted regions may be specified using the `--blacklist </path/to/your_blacklist.bed.gz>`. The blacklisted region file must simply be a `BED` formatted file to function correctly. The [`refgenie blacklist` asset](http://refgenie.databio.org/en/latest/available_assets/#blacklist) uses the [ENCODE blacklists](https://github.com/Boyle-Lab/Blacklist) by default.
 
 The TSS annotation file may be specified using `--TSS-name </path/to/your_TSS_annotations.bed>`. This file is also a `BED` formatted file.
 
 The `feat_annotation` asset may also be directly specified using `--anno-name </path/to/your_custom_feature_annotations.bed.gz>`.  Read [more about using custom reference data](annotation.md).
 
+### Using a custom adapter file
+
+`PEPATAC` defaults to the bundled Nextera adapter file (`tools/NexteraPE-PE.fa`). To use your own adapter sequences (e.g. for non-Nextera library preps), set the `adapters` resource in the pipeline configuration file at `pipelines/pepatac.yaml`:
+
+```yaml
+resources:
+  genome_config: ${REFGENIE}
+  adapters: /path/to/your/adapters.fa
+```
+
+The file must be in FASTA format (the same format consumed by `trimmomatic`'s `ILLUMINACLIP` and `skewer`'s `-x` option). Set `adapters: null` to fall back to the bundled default.
+
 ### Example using manually managed assets
 
 Even when *not* using `refgenie`, you can still grab premade `--chrom-sizes` and `--genome-index` files from the `refgenie` servers. `Refgenie` uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, `2230c535660fb4774114bfa966a62f823fdb6d21acf138d4` is the digest for the human readable alias, "hg38", and `94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4` is the digest for "rCRSd."
diff --git a/docs/count_table.md b/docs/count_table.md
@@ -20,7 +20,20 @@ For a run with a [reference peak set](reference_peaks.md) the project processing
 
 ## Count table *with* a PEPATAC produced consensus peak set
 
-To produce a count table using the project derived consensus peak set *requires* an iterative approach. After generating the initial consensus peak set for a project, you will need to use that as your reference peak set in your `PEP` (with the `frip_ref_peaks:` parameter) and run the sample processing pipeline again to produce peak counts for each of the samples. Because the pipeline *knows* what files have been produced already, it will only perform this step and skip the rest of the pipeline. Then, simply run the project level pipeline again and the count table will be derived from the consensus peak set! 
+To produce a count table using the project derived consensus peak set *requires* an iterative approach. After generating the initial consensus peak set for a project, you will need to use that as your reference peak set in your `PEP` (with the `frip_ref_peaks:` parameter) and run the sample processing pipeline again to produce peak counts for each of the samples. Because the pipeline *knows* what files have been produced already, it will only perform this step and skip the rest of the pipeline. Then, simply run the project level pipeline again and the count table will be derived from the consensus peak set!
+
+**Important:** the per-sample `<sample>_PEPATAC_completed.flag` files left from the first pass will block the re-run unless you either:
+
+- delete the completion flags before re-running:
+  ```console
+  find <output_dir>/results_pipeline -name '*_PEPATAC_completed.flag' -delete
+  ```
+- or pass `--ignore-flags` to `looper run` (see [Looper sample flags](https://looper.databio.org/en/latest/faq/#why-isnt-a-sample-being-processed-by-a-pipeline-not-submitting-flag-found-_statusflag)):
+  ```console
+  looper run --looper-config /path/to/.looper_config.yaml --ignore-flags
+  ```
+
+With the flag cleared, the pipeline detects that only the reference-peak-coverage step is missing and skips everything else.
 
 ## Run the count table generation manually
 
diff --git a/docs/faq.md b/docs/faq.md
@@ -11,7 +11,7 @@ When deciding whether or not to merge technical replicates, you should first fol
 ## How do I know if my samples or replicates are high quality?
 
 - Look over the sample [fragment length distribution](glossary.md#qc-output) plot(s). For a good quality sample you should observe a well-defined peak &lt; 100-bp representing nucleosome-free regions, a second peak around 200-bp representing mono-nucleosomes, then sequentially weaker peaks representing multiple nucleosomes.
-- Observe the individual [TSS enrichment](glossary.md#qc-output) scores for each sample, which is a representation of signal to noise.  A score below 6 is a general cutoff for a sample to be "concerning."  This is an empirical metric and may vary based on the individual data set, but represents a comfortable starting point.
+- Observe the individual [TSS enrichment](glossary.md#qc-output) scores for each sample, which is a representation of signal to noise. The TSS reference is the [`refgene_anno`](http://refgenie.databio.org/en/latest/available_assets/#refgene_anno) refgenie asset, which is built from the UCSC RefGene tables (e.g. `refGene.txt.gz`). A TSSE score below 6 is a general cutoff for a sample to be "concerning" — this is an empirical threshold tuned against `hg38` data and represents a comfortable starting point. For `mm10` and other assemblies, score magnitudes can shift slightly because chromosome composition and gene density differ; the thresholds are empirical and may vary based on the individual data set, sequencing depth, and tissue. For published reference numbers, see the [ENCODE ATAC-seq data standards](https://www.encodeproject.org/atac-seq/), which give per-assembly TSSE bands.
 - Library complexity metrics (for complete explanations, see [terms and definitions from ENCODE](https://www.encodeproject.org/data-standards/terms/)):
     - [Non-redundant fraction (NRF)](glossary.md#qc-output): values &lt; 0.7 are considered concerning; values &gt; 0.9 are ideal
     - [PCR Bottleneck Coefficient 1 (PBC1)](glossary.md#qc-output): values &lt; 0.7 are considered concerning; values &gt; 0.9 are ideal
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -29,12 +29,24 @@ The following files are included in default `PEPATAC` analyses:
 - **&lt;sample_name&gt;_peaks.xls**: An XLS formatted file containing call peak information with 1-based coordinates.
 - **&lt;sample_name&gt;_peaks.narrowPeak**: A BED6+4 format file containing peak locations, peak summits, p-values, and q-values.
 - **&lt;sample_name&gt;_summits.bed**: A BED format file containing the peak summit locations for each peak. Useful for finding motifs at these sites.
-- **&lt;sample_name&gt;_peaks_coverage.bed**: A BED format file containing the number of overlapping reads in each peak.  
+- **&lt;sample_name&gt;_peaks_coverage.bed**: An 8-column BED file with read coverage over the per-sample called peaks (output of `bedtools coverage -a <peaks> -b <dedup.bam>`, then normalized).  
   Column format:
     1. chromosome name
     2. start position of peak
     3. end position of peak
-    4. read count
+    4. number of reads overlapping the peak (`read_count`)
+    5. number of bases at depth ≥ 1 (`base_count`)
+    6. peak width
+    7. fraction of peak covered at depth ≥ 1
+    8. normalized counts: `base_count / sum(base_count) * 1e6` (RPM-style across the sample's peaks)
+- **&lt;sample_name&gt;_ref_peaks_coverage.bed**: A 15-column file produced when the pipeline is run with `--frip-ref-peaks <reference.narrowPeak>`. The reference narrowPeak coordinates are preserved (so all samples share an identical peak set), with read coverage and a normalized count appended. Suitable for direct cross-sample concatenation into a count matrix.  
+  Column format:
+    1–10. The 10 narrowPeak columns from the reference peak set: chrom, start, end, name, score, strand, signalValue, pValue, qValue, peak summit offset
+    11. number of reads overlapping the peak (`read_count`)
+    12. number of bases at depth ≥ 1 (`base_count`)
+    13. peak width
+    14. fraction of peak covered at depth ≥ 1
+    15. normalized counts: `base_count / sum(base_count) * 1e6`
 - **&lt;sample_name&gt;_peaks.bigBed**: A bigNarrowPeak (bigBed) formatted version of the narrowPeak file produced by `MACS2`. Check out the [bigNarrowPeak track format](https://genome.ucsc.edu/goldenpath/help/bigNarrowPeak.html) page for more information.
 
 ## QC output