Skip to content

Commit 8bda2e4

Browse files
committed
Fix pepatac.py SE+bwa NameError, harden filter_pair check, add --skip-dedup
- _align() referenced an undefined in the single-end + bwa branch paired chain (minus filter_pair) for both --keep and no-keep paths: pm.run([cmd1, cmd2, cmd3, cmd4], <target>). (#299) - Fix missing pm.fail_pipeline in the unmap_fq1 branch of the filter_paired_fq.pl handle check; previously a stuck filter on R1 set an error string that was never raised. Reworked the error message into a shared template that points at the underlying psutil introspection issue and recommends both --keep and --noFIFO as workarounds. (#234) - Add --skip-dedup flag for protocols where duplicates are biologically meaningful (CUT&Tag, CUT&RUN). When set: copy mapping_genome_bam to _sort_dedup.bam so downstream peak calling finds the expected path; report Duplicate_reads=0 and pass through Dedup_aligned_reads/Dedup_alignment_rate/Dedup_total_efficiency from the pre-dedup metrics. Plumbed through sample_pipeline_interface.yaml so it can be set per-sample. (#249) - Drop redundant Time/Success keys from pepatac_output_schema.yaml (both samples: and project: blocks). These are pipestat's auto- tracked status fields and the duplicate declaration triggered "SchemaError: Overlap between project- and sample-level keys" on newer pipestat. (#322, #305) - Fix _LOGGER NameError in tools/bamQC.py and bamSitesToWig.py: the variable was only defined inside , so pararead workers re-importing the module under multiprocessing 'spawn' (macOS default) hit NameError when class methods logged. Added a module-level fallback logger above each class definition. (#266) - Fix peakCounts() ref-peaks ignored when *_peaks_coverage.bed.gz coexists with *_ref_peaks_coverage.bed: the shared variable preferred .bed.gz from the regular peaks file and then looked for a non-existent _ref_peaks_coverage.bed.gz, falling through to the "not derived from a singular reference peak set" warning. Detect ref vs regular extensions independently. (#218, #219) - Guard refgenie[sample.genome] lookups in sample_pipeline_interface.yaml with , so projects with non-refgenie genomes (e.g. galGal6, bosTau9) no longer crash the Jinja template with an attmap AttributeError; instead they fall through to the per-sample paths or error cleanly from pepatac.py. (#231) - Fix plotAnno() empty-input fallback path bug: was constructing file.path(<output_file>, "<sample>_partition_dist.pdf") (treating the output pdf as a directory) and quit()ing the R session. Replaced with return(ggplot()), matching the function's other empty-data branches; the caller writes a clean blank placeholder at the expected target. (#232) Closes #299, #234, #249, #322, #305, #266, #218, #219, #231, #232.
1 parent 6c2eed2 commit 8bda2e4

6 files changed

Lines changed: 88 additions & 55 deletions

File tree

PEPATACr/R/PEPATACr.R

Lines changed: 27 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1423,12 +1423,13 @@ plotAnno <- function(plot = c("chromosome", "tss", "genomic"),
14231423
if (file.exists(file.path(input)) && info$size != 0) {
14241424
in_file <- data.table::fread(file.path(input))
14251425
} else {
1426-
out_file <- file.path(output, paste(basename(sample_path),
1427-
output_type,
1428-
"partition_dist.pdf",
1429-
sep="_"))
1430-
system2(paste("touch"), out_file)
1431-
quit()
1426+
# No input data — return an empty ggplot so the caller's
1427+
# pdf()/png() writes a blank placeholder at the expected
1428+
# target path. (Earlier versions tried to touch a hard-coded
1429+
# *_partition_dist.pdf inside `output` treating it as a
1430+
# directory, which silently dropped the placeholder when
1431+
# `output` was a file path. See #232.)
1432+
return(ggplot())
14321433
}
14331434
}
14341435

@@ -2855,29 +2856,33 @@ peakCounts <- function(sample_table, summary_dir, results_subdir, assets,
28552856
}
28562857
}
28572858

2858-
# check if coverage files are compressed
2859-
if (any(file.exists(file.path(results_subdir,
2860-
sample_names, paste0("peak_calling_", genomes),
2861-
paste0(sample_names, "_ref_peaks_coverage.bed.gz"))))) {
2862-
ext <- ".bed.gz"
2863-
} else if (any(file.exists(file.path(results_subdir,
2864-
sample_names, paste0("peak_calling_", genomes),
2865-
paste0(sample_names, "_peaks_coverage.bed.gz"))))) {
2866-
ext <- ".bed.gz"
2867-
} else {
2868-
ext <- ".bed"
2859+
# Detect extension independently for reference vs sample peak coverage,
2860+
# since users commonly have *_peaks_coverage.bed.gz from the first sample
2861+
# run plus *_ref_peaks_coverage.bed from the re-run (or vice versa). A
2862+
# single shared `ext` defeats the ref-peaks lookup in that mixed state.
2863+
detect_ext <- function(suffix) {
2864+
for (e in c(".bed.gz", ".bed")) {
2865+
if (any(file.exists(file.path(
2866+
results_subdir,
2867+
sample_names, paste0("peak_calling_", genomes),
2868+
paste0(sample_names, suffix, e))))) {
2869+
return(e)
2870+
}
2871+
}
2872+
return(NA_character_)
28692873
}
2874+
ref_ext <- detect_ext("_ref_peaks_coverage")
2875+
fallback_ext <- detect_ext("_peaks_coverage")
2876+
if (is.na(fallback_ext)) fallback_ext <- ".bed"
28702877

28712878
# Use reference peak coverage file if available
2872-
if (any(file.exists(file.path(results_subdir,
2873-
sample_names, paste0("peak_calling_", genomes),
2874-
paste0(sample_names, "_ref_peaks_coverage", ext))))) {
2875-
peak_file_name = paste0("_ref_peaks_coverage", ext)
2879+
if (!is.na(ref_ext)) {
2880+
peak_file_name = paste0("_ref_peaks_coverage", ref_ext)
28762881
reference = TRUE
28772882
} else {
28782883
warning("Peak coverage files are not derived from a singular reference peak set.")
28792884
reference = FALSE
2880-
peak_file_name = paste0("_peaks_coverage", ext)
2885+
peak_file_name = paste0("_peaks_coverage", fallback_ext)
28812886
}
28822887

28832888
# generate paths to peak coverage files

pepatac_output_schema.yaml

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -309,12 +309,6 @@ properties:
309309
required:
310310
- path
311311
- title
312-
Time:
313-
type: string
314-
description: "time"
315-
Success:
316-
type: string
317-
description: "success"
318312
project:
319313
type: object
320314
properties:
@@ -411,10 +405,4 @@ properties:
411405
type: string
412406
required:
413407
- path
414-
- title
415-
Time:
416-
type: string
417-
description: "Total elapsed pipeline time"
418-
Success:
419-
type: string
420-
description: "Timestamp when pipeline completed successfully"
408+
- title

pipelines/pepatac.py

Lines changed: 34 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,11 @@ def parse_arguments():
118118
help="Skip FastQC. Useful for bugs in FastQC "
119119
"that appear with some sequence read files.")
120120

121+
parser.add_argument("--skip-dedup", dest="skip_dedup", action='store_true',
122+
help="Skip duplicate removal. Recommended for protocols "
123+
"where duplicates are biologically meaningful "
124+
"(e.g. CUT&Tag, CUT&RUN).")
125+
121126
# Prealignment genome assets
122127
parser.add_argument("--prealignment-names", default=[], type=str,
123128
nargs="+",
@@ -342,9 +347,9 @@ def _align(args, tools, paired, useFIFO, unmap_fq1, unmap_fq2,
342347
pm.run([cmd1, cmd2, cmd3, cmd4, filter_pair], out_fastq_r2_gz)
343348
else:
344349
if args.keep:
345-
pm.run(cmd, mapped_bam)
350+
pm.run([cmd1, cmd2, cmd3, cmd4], mapped_bam)
346351
else:
347-
pm.run(cmd, out_fastq_tmp_gz)
352+
pm.run([cmd1, cmd2, cmd3, cmd4], out_fastq_tmp_gz)
348353

349354
cmd = tools.samtools + " view -c " + mapped_bam
350355
align_exact = pm.checkprint(cmd)
@@ -981,6 +986,12 @@ def no_handle(fq):
981986
pm.debug("{} is released! \n".format(os.path.abspath(fq)))
982987
return True
983988

989+
handle_fail_msg = (
990+
"Fastq filter_paired_fq.pl function did not complete successfully. "
991+
"Re-run with `--keep` or `--noFIFO` to bypass the non-blocking "
992+
"filter_pair path, which relies on psutil process introspection "
993+
"and can fail in environments where it lacks permission to inspect "
994+
"other processes' file handles.")
984995
if args.paired_end and not os.path.exists(mapping_genome_bam):
985996
if not pypiper.is_gzipped_fastq(unmap_fq1):
986997
checks = 1
@@ -989,20 +1000,15 @@ def no_handle(fq):
9891000
checks += 1
9901001
pm.debug("Check count fq1: {}".format(str(checks)))
9911002
if checks > 100 and not no_handle(unmap_fq1):
992-
err_msg = ("Fastq filter_paired_fq.pl function did not "
993-
"complete successfully. Try running the pipeline "
994-
"with `--keep`.")
1003+
pm.fail_pipeline(IOError(handle_fail_msg))
9951004
if not pypiper.is_gzipped_fastq(unmap_fq2):
9961005
checks = 1
9971006
# Check unmap_fq2
9981007
while not no_handle(unmap_fq2) and checks < 10000:
9991008
checks += 1
10001009
pm.debug("Check count fq2: {}".format(str(checks)))
10011010
if checks > 100 and not no_handle(unmap_fq2):
1002-
err_msg = ("Fastq filter_paired_fq.pl function did not "
1003-
"complete successfully. Try running the pipeline "
1004-
"with `--keep`.")
1005-
pm.fail_pipeline(IOError(err_msg))
1011+
pm.fail_pipeline(IOError(handle_fail_msg))
10061012

10071013
for unmapped_fq in to_compress:
10081014
# Compress unmapped fastq reads
@@ -1222,6 +1228,17 @@ def estimate_lib_size(dedup_log):
12221228
pm.report_result("Picard_est_lib_size", picard_est_lib_size)
12231229

12241230
def post_dup_aligned_reads(dedup_log):
1231+
if args.skip_dedup:
1232+
ar = float(pm.get_stat("Aligned_reads"))
1233+
tr = float(pm.get_stat("Trimmed_reads"))
1234+
rr = float(pm.get_stat("Raw_reads"))
1235+
pm.report_result("Duplicate_reads", 0)
1236+
pm.report_result("Dedup_aligned_reads", ar)
1237+
pm.report_result("Dedup_alignment_rate",
1238+
round(float(ar) * 100 / float(tr), 2))
1239+
pm.report_result("Dedup_total_efficiency",
1240+
round(float(ar) * 100 / float(rr), 2))
1241+
return
12251242
if args.deduplicator == "picard":
12261243
cmd = ("grep -A2 'METRICS CLASS' " + dedup_log +
12271244
" | tail -n 1 | awk '{print $(NF-3)}'")
@@ -1269,8 +1286,14 @@ def post_dup_aligned_reads(dedup_log):
12691286
java_settings = '-Xmx{mem}'.format(mem=pm.mem)
12701287
else:
12711288
java_settings = param.java_settings.params
1272-
if args.deduplicator == "picard":
1273-
cmd1 = (tools.java + " " + java_settings + " -jar " +
1289+
if args.skip_dedup:
1290+
# User opted out of duplicate removal (e.g. CUT&Tag/CUT&RUN protocols).
1291+
# Reuse the post-alignment BAM as the "dedup" endpoint so downstream
1292+
# steps can find _sort_dedup.bam without a code-path fork.
1293+
cmd1 = "cp {} {}".format(mapping_genome_bam, rmdup_bam)
1294+
cmd2 = tools.samtools + " index " + rmdup_bam
1295+
elif args.deduplicator == "picard":
1296+
cmd1 = (tools.java + " " + java_settings + " -jar " +
12741297
tools.picard + " MarkDuplicates")
12751298
cmd1 += " INPUT=" + mapping_genome_bam
12761299
cmd1 += " OUTPUT=" + rmdup_bam

sample_pipeline_interface.yaml

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ sample_interface:
1212
{% if sample.read2 is defined %} --input2 { sample.read2 } {% endif %}
1313
--single-or-paired { sample.read_type }
1414
--genome { sample.genome }
15-
{% if sample.chrom_sizes is defined %} --chrom-sizes { sample.chrom_sizes } {% elif refgenie[sample.genome].fasta is defined %} --chrom-sizes { refgenie[sample.genome].fasta.chrom_sizes } {% endif %}
16-
{% if sample.TSS_name is defined %} --TSS-name { sample.TSS_name } {% elif refgenie[sample.genome].refgene_anno is defined %} --TSS-name { refgenie[sample.genome].refgene_anno.refgene_tss } {% endif %}
17-
{% if sample.blacklist is defined %} --blacklist { sample.blacklist } {% elif refgenie[sample.genome].blacklist is defined %} --blacklist { refgenie[sample.genome].blacklist.blacklist } {% endif %}
18-
{% if sample.anno_name is defined %} --anno-name { sample.anno_name } {% elif refgenie[sample.genome].feat_annotation is defined %} --anno-name { refgenie[sample.genome].feat_annotation.feat_annotation } {% endif %}
15+
{% if sample.chrom_sizes is defined %} --chrom-sizes { sample.chrom_sizes } {% elif sample.genome in refgenie and refgenie[sample.genome].fasta is defined %} --chrom-sizes { refgenie[sample.genome].fasta.chrom_sizes } {% endif %}
16+
{% if sample.TSS_name is defined %} --TSS-name { sample.TSS_name } {% elif sample.genome in refgenie and refgenie[sample.genome].refgene_anno is defined %} --TSS-name { refgenie[sample.genome].refgene_anno.refgene_tss } {% endif %}
17+
{% if sample.blacklist is defined %} --blacklist { sample.blacklist } {% elif sample.genome in refgenie and refgenie[sample.genome].blacklist is defined %} --blacklist { refgenie[sample.genome].blacklist.blacklist } {% endif %}
18+
{% if sample.anno_name is defined %} --anno-name { sample.anno_name } {% elif sample.genome in refgenie and refgenie[sample.genome].feat_annotation is defined %} --anno-name { refgenie[sample.genome].feat_annotation.feat_annotation } {% endif %}
1919
{% if sample.trimmer is defined %} --trimmer { sample.trimmer } {% else %} --trimmer "skewer" {% endif %}
2020
{% if sample.aligner is defined %} --aligner { sample.aligner } {% set aligner = sample.aligner %} {% else %} --aligner "bowtie2" {% set aligner = "bowtie2" %} {% endif %}
21-
{% if aligner == "bowtie2" or sample.aligner == "bowtie2" %} {% if sample.genome_index is defined %} --genome-index { sample.genome_index } {% elif refgenie[sample.genome].bowtie2_index is defined %} --genome-index { refgenie[sample.genome].bowtie2_index.dir } {% endif %} {% else %} {% if sample.genome_index is defined %} --genome-index { sample.genome_index } {% elif refgenie[sample.genome].bwa_index is defined %} --genome-index { refgenie[sample.genome].bwa_index.dir } {% endif %} {% endif %}
21+
{% if aligner == "bowtie2" or sample.aligner == "bowtie2" %} {% if sample.genome_index is defined %} --genome-index { sample.genome_index } {% elif sample.genome in refgenie and refgenie[sample.genome].bowtie2_index is defined %} --genome-index { refgenie[sample.genome].bowtie2_index.dir } {% endif %} {% else %} {% if sample.genome_index is defined %} --genome-index { sample.genome_index } {% elif sample.genome in refgenie and refgenie[sample.genome].bwa_index is defined %} --genome-index { refgenie[sample.genome].bwa_index.dir } {% endif %} {% endif %}
2222
{% if sample.prealignment_index is defined %} --prealignment-index { sample.prealignment_index } {% endif %}
23-
{% if sample.prealignment_names is defined %} {% if aligner == "bowtie2" or sample.aligner == "bowtie2" %} --prealignment-index {% for p in sample.prealignment_names %} { p ~ '=' ~ refgenie[p].bowtie2_index.dir } {% endfor %} {% else %} --prealignment-index {% for p in sample.prealignment_names %} { p ~ '=' ~ refgenie[p].bwa_index.dir } {% endfor %} {% endif %} {% endif %}
23+
{% if sample.prealignment_names is defined %} {% if aligner == "bowtie2" or sample.aligner == "bowtie2" %} --prealignment-index {% for p in sample.prealignment_names %} {% if p in refgenie %} { p ~ '=' ~ refgenie[p].bowtie2_index.dir } {% endif %} {% endfor %} {% else %} --prealignment-index {% for p in sample.prealignment_names %} {% if p in refgenie %} { p ~ '=' ~ refgenie[p].bwa_index.dir } {% endif %} {% endfor %} {% endif %} {% endif %}
2424
{% if sample.deduplicator is defined %} --deduplicator { sample.deduplicator } {% endif %}
2525
{% if sample.peak_caller is defined %} --peak-caller { sample.peak_caller } {% else %} --peak-caller "macs3" {% endif %}
2626
{% if sample.peak_type is defined %} --peak-type { sample.peak_type } {% else %} --peak-type "fixed" {% endif %}
@@ -29,15 +29,16 @@ sample_interface:
2929
{% if sample.frip_ref_peaks is defined %} --frip-ref-peaks { sample.frip_ref_peaks } {% endif %}
3030
{% if sample.motif is defined %} --motif {% endif %}
3131
{% if sample.sob is defined %} --sob {% endif %}
32-
{% if sample.sob is defined %} {% if refgenie[sample.genome].tallymer_index is defined %} --search-file { refgenie[sample.genome].tallymer_index.search_file } {% endif %} {% endif %}
33-
{% if sample.sob is defined %} {% if refgenie[sample.genome].fasta is defined %} --fasta { refgenie[sample.genome].fasta.fasta } {% endif %} {% endif %}
34-
{% if sample.fasta is defined %} --fasta { sample.fasta } {% elif refgenie[sample.genome].fasta is defined %} --fasta { refgenie[sample.genome].fasta.fasta } {% endif %}
32+
{% if sample.sob is defined %} {% if sample.genome in refgenie and refgenie[sample.genome].tallymer_index is defined %} --search-file { refgenie[sample.genome].tallymer_index.search_file } {% endif %} {% endif %}
33+
{% if sample.sob is defined %} {% if sample.genome in refgenie and refgenie[sample.genome].fasta is defined %} --fasta { refgenie[sample.genome].fasta.fasta } {% endif %} {% endif %}
34+
{% if sample.fasta is defined %} --fasta { sample.fasta } {% elif sample.genome in refgenie and refgenie[sample.genome].fasta is defined %} --fasta { refgenie[sample.genome].fasta.fasta } {% endif %}
3535
{% if sample.no_scale is defined %} --no-scale {% endif %}
3636
{% if sample.prioritize is defined %} --prioritize {% endif %}
3737
{% if sample.keep is defined %} --keep {% endif %}
3838
{% if sample.no_fifo is defined %} --noFIFO {% endif %}
3939
{% if sample.lite is defined %} --lite {% endif %}
4040
{% if sample.skipqc is defined %} --skipqc {% endif %}
41+
{% if sample.skip_dedup is defined %} --skip-dedup {% endif %}
4142
--pipestat-config {pipestat.config_file}
4243
4344
compute:

tools/bamQC.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,22 @@
1111
__email__ = "jasonsmith@virginia.edu"
1212

1313
from argparse import ArgumentParser
14+
import logging
1415
import os
1516
import sys
1617
import pararead
1718
import logmuse
1819
import pandas as _pd
1920
import numpy as np
2021

22+
# Module-level fallback so class methods always have a logger, even when
23+
# pararead workers re-import this module under multiprocessing 'spawn'
24+
# (macOS default, and elsewhere when fork is unavailable). The __main__
25+
# block below upgrades this to a logmuse-configured logger for the
26+
# parent process.
27+
_LOGGER = logging.getLogger(__name__)
28+
29+
2130
class bamQC(pararead.ParaReadProcessor):
2231
def __init__(self, reads_filename, n_proc, out_filename, verbosity):
2332
"""

tools/bamSitesToWig.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
from argparse import ArgumentParser
1010
import itertools # Used for nested region looping across reads
11+
import logging
1112
import numpy
1213
from operator import methodcaller
1314
import os
@@ -21,6 +22,12 @@
2122

2223
MODES = ["dnase", "atac"]
2324

25+
# Module-level fallback logger so class methods always have one available,
26+
# even when pararead workers re-import this module under multiprocessing
27+
# 'spawn' (macOS default). The __main__ block upgrades this to a
28+
# logmuse-configured logger for the parent process. (#266)
29+
_LOGGER = logging.getLogger(__name__)
30+
2431
# A function object like this will be pickled by the parallel call to map,
2532
# So it cannot contain huge files or the pickling will limit everything.
2633
# For this reason I must rely on global vars for the big stuff.

0 commit comments

Comments
 (0)