Skip to content

Commit 41fbb9e

Browse files
committed
Docs: TSSE cutoff guidance, custom adapter, count-table flag, /home/jps3dp/tools/refgenie_config.yaml workaround
- faq.md: expand the TSSE entry to name the refgene_anno asset / UCSC RefGene as the source of TSS coords and note that the cutoff-of-6 threshold is hg38-tuned and empirical. Point at ENCODE ATAC-seq data standards for per-assembly reference numbers. (#235) - assets.md: add a Using a custom adapter file subsection documenting the adapters resource override in pipelines/pepatac.yaml. (#252) - assets.md: document the /home/jps3dp/tools/refgenie_config.yaml-required-even-with-manual-paths quirk and the empty-refgenie-config workaround. The proper fix is in the in-progress refgenie 1.0 migration (PR #327). (#251) - count_table.md: make the per-sample PEPATAC_completed.flag handling explicit in the consensus-peak-set count table workflow. Two paths: delete the flag files (one-liner with find -delete) or pass --ignore-flags to looper run. (#215) - assets.md: troubleshooting subsection for TypeError: 'NoneType' object is not iterable — root-caused to incomplete refgenie assets (commonly missing prealignment FASTA), with diagnostic and fix commands. The error itself is upstream refgenconf behavior; replaced by the refgenie 1.0 migration (PR #327). (#216) - glossary.md: document column formats for _peaks_coverage.bed (8 columns) and _ref_peaks_coverage.bed (15 columns; narrowPeak coordinates + bedtools coverage stats + normalized count). (#233) - assets.md: Running a non-refgenie genome through looper subsection — sample_modifiers/imply pattern with chrom_sizes, genome_index, etc. set per-sample. (#231 docs portion) Closes #235, #252, #251, #215, #216, #233.
1 parent 8bda2e4 commit 41fbb9e

4 files changed

Lines changed: 92 additions & 4 deletions

File tree

docs/assets.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,12 +64,75 @@ looper run examples/test_project/test_config_refgenie.yaml
6464

6565
Assets may also be managed manually and specified directly to the pipeline. While this frees you from needing `refgenie` installed and initialized, it does require a few more arguments to be specified.
6666

67+
> **Note**: even when you provide every path manually, the pipeline interface (`sample_pipeline_interface.yaml`) currently runs a `refgenconf.looper_refgenie_populate` pre-submit hook that expands `$REFGENIE`. If `$REFGENIE` is unset you'll see `FileNotFoundError: [Errno 2] No such file or directory: '$REFGENIE'`. The simplest workaround is to point `$REFGENIE` at an empty refgenie config — the hook succeeds against an empty config, and your manual paths are used regardless:
68+
>
69+
> ```console
70+
> refgenie init -c /tmp/empty_refgenie.yaml
71+
> export REFGENIE=/tmp/empty_refgenie.yaml
72+
> ```
73+
>
74+
> A proper "no refgenie" path is being addressed as part of the in-progress refgenie 1.0 migration.
75+
76+
### Running a non-refgenie genome through `looper`
77+
78+
If your samples use a genome that isn't in your refgenie config (e.g. `galGal6`, `bosTau9`, an unaligned custom assembly), the pipeline interface jinja template will only succeed if every required asset path is provided at the sample level — otherwise the `refgenie[sample.genome].<asset>` lookup falls through to a missing-key error. Set these in your PEP `sample_modifiers` block (alongside `genome`):
79+
80+
```yaml
81+
sample_modifiers:
82+
imply:
83+
- if:
84+
organism: ["chicken"]
85+
then:
86+
genome: galGal6
87+
chrom_sizes: /path/to/galGal6.chrom.sizes
88+
genome_index: /path/to/galGal6_bowtie2_index/galGal6
89+
# Optional, only if you have them:
90+
TSS_name: /path/to/galGal6_TSS.bed
91+
blacklist: /path/to/galGal6_blacklist.bed
92+
anno_name: /path/to/galGal6_feat_annotation.bed.gz
93+
genome_size: "1.05e9"
94+
```
95+
96+
Each `sample.X` attribute short-circuits the corresponding `refgenie[sample.genome].X` lookup in `sample_pipeline_interface.yaml`, so refgenie is never queried for that genome. Samples using a refgenie-managed genome (e.g. `hg38`, `mm10`) and samples using a manually-managed genome can be processed in the same project.
97+
98+
### Troubleshooting: `TypeError: 'NoneType' object is not iterable`
99+
100+
If `looper run` fails before the pipeline starts with this trace, ending in `refgenconf/populator.py``refgenconf/refgenconf.py``for seek_key_name in get_tag_seek_keys(tag_mapping)``TypeError: 'NoneType' object is not iterable`, the cause is **a refgenie genome with one or more incomplete assets**. The pre-submit hook iterates every asset for every registered genome, and one of them returned `None`. This commonly happens when prealignment genomes are partially pulled (e.g. `rCRSd/bowtie2_index` is pulled but `rCRSd/fasta` is not).
101+
102+
To diagnose, list every asset in your refgenie config:
103+
104+
```console
105+
refgenie list -g rCRSd # list assets for one genome
106+
refgenie list # list all genomes
107+
```
108+
109+
Then pull anything missing — the most common gap is the prealignment FASTA:
110+
111+
```console
112+
refgenie pull rCRSd/fasta
113+
refgenie pull human_repeats/fasta
114+
```
115+
116+
The unhelpful error message is upstream behavior in `refgenconf` and is replaced wholesale by the in-progress refgenie 1.0 migration (PR #327).
117+
67118
Custom blacklisted regions may be specified using the `--blacklist </path/to/your_blacklist.bed.gz>`. The blacklisted region file must simply be a `BED` formatted file to function correctly. The [`refgenie blacklist` asset](http://refgenie.databio.org/en/latest/available_assets/#blacklist) uses the [ENCODE blacklists](https://github.com/Boyle-Lab/Blacklist) by default.
68119

69120
The TSS annotation file may be specified using `--TSS-name </path/to/your_TSS_annotations.bed>`. This file is also a `BED` formatted file.
70121

71122
The `feat_annotation` asset may also be directly specified using `--anno-name </path/to/your_custom_feature_annotations.bed.gz>`. Read [more about using custom reference data](annotation.md).
72123

124+
### Using a custom adapter file
125+
126+
`PEPATAC` defaults to the bundled Nextera adapter file (`tools/NexteraPE-PE.fa`). To use your own adapter sequences (e.g. for non-Nextera library preps), set the `adapters` resource in the pipeline configuration file at `pipelines/pepatac.yaml`:
127+
128+
```yaml
129+
resources:
130+
genome_config: ${REFGENIE}
131+
adapters: /path/to/your/adapters.fa
132+
```
133+
134+
The file must be in FASTA format (the same format consumed by `trimmomatic`'s `ILLUMINACLIP` and `skewer`'s `-x` option). Set `adapters: null` to fall back to the bundled default.
135+
73136
### Example using manually managed assets
74137

75138
Even when *not* using `refgenie`, you can still grab premade `--chrom-sizes` and `--genome-index` files from the `refgenie` servers. `Refgenie` uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, `2230c535660fb4774114bfa966a62f823fdb6d21acf138d4` is the digest for the human readable alias, "hg38", and `94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4` is the digest for "rCRSd."

docs/count_table.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,20 @@ For a run with a [reference peak set](reference_peaks.md) the project processing
2020

2121
## Count table *with* a PEPATAC produced consensus peak set
2222

23-
To produce a count table using the project derived consensus peak set *requires* an iterative approach. After generating the initial consensus peak set for a project, you will need to use that as your reference peak set in your `PEP` (with the `frip_ref_peaks:` parameter) and run the sample processing pipeline again to produce peak counts for each of the samples. Because the pipeline *knows* what files have been produced already, it will only perform this step and skip the rest of the pipeline. Then, simply run the project level pipeline again and the count table will be derived from the consensus peak set!
23+
To produce a count table using the project derived consensus peak set *requires* an iterative approach. After generating the initial consensus peak set for a project, you will need to use that as your reference peak set in your `PEP` (with the `frip_ref_peaks:` parameter) and run the sample processing pipeline again to produce peak counts for each of the samples. Because the pipeline *knows* what files have been produced already, it will only perform this step and skip the rest of the pipeline. Then, simply run the project level pipeline again and the count table will be derived from the consensus peak set!
24+
25+
**Important:** the per-sample `<sample>_PEPATAC_completed.flag` files left from the first pass will block the re-run unless you either:
26+
27+
- delete the completion flags before re-running:
28+
```console
29+
find <output_dir>/results_pipeline -name '*_PEPATAC_completed.flag' -delete
30+
```
31+
- or pass `--ignore-flags` to `looper run` (see [Looper sample flags](https://looper.databio.org/en/latest/faq/#why-isnt-a-sample-being-processed-by-a-pipeline-not-submitting-flag-found-_statusflag)):
32+
```console
33+
looper run --looper-config /path/to/.looper_config.yaml --ignore-flags
34+
```
35+
36+
With the flag cleared, the pipeline detects that only the reference-peak-coverage step is missing and skips everything else.
2437

2538
## Run the count table generation manually
2639

docs/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ When deciding whether or not to merge technical replicates, you should first fol
1111
## How do I know if my samples or replicates are high quality?
1212

1313
- Look over the sample [fragment length distribution](glossary.md#qc-output) plot(s). For a good quality sample you should observe a well-defined peak &lt; 100-bp representing nucleosome-free regions, a second peak around 200-bp representing mono-nucleosomes, then sequentially weaker peaks representing multiple nucleosomes.
14-
- Observe the individual [TSS enrichment](glossary.md#qc-output) scores for each sample, which is a representation of signal to noise. A score below 6 is a general cutoff for a sample to be "concerning." This is an empirical metric and may vary based on the individual data set, but represents a comfortable starting point.
14+
- Observe the individual [TSS enrichment](glossary.md#qc-output) scores for each sample, which is a representation of signal to noise. The TSS reference is the [`refgene_anno`](http://refgenie.databio.org/en/latest/available_assets/#refgene_anno) refgenie asset, which is built from the UCSC RefGene tables (e.g. `refGene.txt.gz`). A TSSE score below 6 is a general cutoff for a sample to be "concerning" — this is an empirical threshold tuned against `hg38` data and represents a comfortable starting point. For `mm10` and other assemblies, score magnitudes can shift slightly because chromosome composition and gene density differ; the thresholds are empirical and may vary based on the individual data set, sequencing depth, and tissue. For published reference numbers, see the [ENCODE ATAC-seq data standards](https://www.encodeproject.org/atac-seq/), which give per-assembly TSSE bands.
1515
- Library complexity metrics (for complete explanations, see [terms and definitions from ENCODE](https://www.encodeproject.org/data-standards/terms/)):
1616
- [Non-redundant fraction (NRF)](glossary.md#qc-output): values &lt; 0.7 are considered concerning; values &gt; 0.9 are ideal
1717
- [PCR Bottleneck Coefficient 1 (PBC1)](glossary.md#qc-output): values &lt; 0.7 are considered concerning; values &gt; 0.9 are ideal

docs/glossary.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,24 @@ The following files are included in default `PEPATAC` analyses:
2929
- **&lt;sample_name&gt;_peaks.xls**: An XLS formatted file containing call peak information with 1-based coordinates.
3030
- **&lt;sample_name&gt;_peaks.narrowPeak**: A BED6+4 format file containing peak locations, peak summits, p-values, and q-values.
3131
- **&lt;sample_name&gt;_summits.bed**: A BED format file containing the peak summit locations for each peak. Useful for finding motifs at these sites.
32-
- **&lt;sample_name&gt;_peaks_coverage.bed**: A BED format file containing the number of overlapping reads in each peak.
32+
- **&lt;sample_name&gt;_peaks_coverage.bed**: An 8-column BED file with read coverage over the per-sample called peaks (output of `bedtools coverage -a <peaks> -b <dedup.bam>`, then normalized).
3333
Column format:
3434
1. chromosome name
3535
2. start position of peak
3636
3. end position of peak
37-
4. read count
37+
4. number of reads overlapping the peak (`read_count`)
38+
5. number of bases at depth ≥ 1 (`base_count`)
39+
6. peak width
40+
7. fraction of peak covered at depth ≥ 1
41+
8. normalized counts: `base_count / sum(base_count) * 1e6` (RPM-style across the sample's peaks)
42+
- **&lt;sample_name&gt;_ref_peaks_coverage.bed**: A 15-column file produced when the pipeline is run with `--frip-ref-peaks <reference.narrowPeak>`. The reference narrowPeak coordinates are preserved (so all samples share an identical peak set), with read coverage and a normalized count appended. Suitable for direct cross-sample concatenation into a count matrix.
43+
Column format:
44+
1–10. The 10 narrowPeak columns from the reference peak set: chrom, start, end, name, score, strand, signalValue, pValue, qValue, peak summit offset
45+
11. number of reads overlapping the peak (`read_count`)
46+
12. number of bases at depth ≥ 1 (`base_count`)
47+
13. peak width
48+
14. fraction of peak covered at depth ≥ 1
49+
15. normalized counts: `base_count / sum(base_count) * 1e6`
3850
- **&lt;sample_name&gt;_peaks.bigBed**: A bigNarrowPeak (bigBed) formatted version of the narrowPeak file produced by `MACS2`. Check out the [bigNarrowPeak track format](https://genome.ucsc.edu/goldenpath/help/bigNarrowPeak.html) page for more information.
3951

4052
## QC output

0 commit comments

Comments
 (0)