Skip to content

STAR drop-in: output-shape gaps (Log.out / Log.progress.out missing, SJ.pass1.out.tab at top level) #28

@pinin4fjords

Description

@pinin4fjords

Summary

Two small STAR drop-in gaps in the output layout that aren't worth filing individually but together create friction for pipelines that wrap STAR's output convention. Both verifiable in one paired MRE.

  • Log.out and Log.progress.out are never written (STAR writes them alongside Log.final.out).
  • SJ.pass1.out.tab is emitted at the top level, where STAR keeps its pass-1 intermediates inside <prefix>_STARpass1/.

STAR reference behaviour

STAR's header writer is at source/samHeaders.cpp; separate Log.out and Log.progress.out writers are at source/Parameters_openReadsFiles.cpp and the source/InOutStreams.cpp init. Pass-1 outputs (Log.final.out, SJ.out.tab) live inside <prefix>_STARpass1/ — the two-pass orchestration is in source/twoPass.cpp (it mkdirs the _STARpass1 directory and redirects pass-1 output into it).

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-28 && cd /tmp/rustar-mre-28

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== STAR top-level + pass1 dir ==="
ls -d STAR* 2>/dev/null
echo "--- inside STAR._STARpass1: ---"
ls -1 STAR._STARpass1/ 2>/dev/null || echo "(no _STARpass1 directory)"

echo
echo "=== rustar (RUS. is a directory; see issue #26) ==="
ls -1 RUS./
echo "--- pass-1 intermediate dir present? ---"
ls -d RUS./*_STARpass1 2>/dev/null || echo "(no _STARpass1 directory inside RUS./)"

Observed (verified on the same fresh run)

STAR (top level):

STAR.Aligned.out.bam
STAR.Aligned.toTranscriptome.out.bam   # with --quantMode TranscriptomeSAM
STAR.Log.final.out
STAR.Log.out
STAR.Log.progress.out
STAR.SJ.out.tab
STAR._STARgenome/
STAR._STARpass1/

STAR pass-1 contents:

Log.final.out
SJ.out.tab

rustar (inside RUS./):

Aligned.out.bam
Aligned.toTranscriptome.out.bam       # with --quantMode TranscriptomeSAM
Log.final.out                         # <-- only this; no Log.out / Log.progress.out
SJ.out.tab
SJ.pass1.out.tab                      # <-- at top level, not in a _STARpass1/ dir

Suggested fix

Gap 1: Log.out / Log.progress.out

Add a Log.out writer that records parameters at start and a periodic Log.progress.out writer (per-chunk; STAR updates ~every minute during alignment). Even minimal stubs (a parameter dump for Log.out, a single "done" line for Log.progress.out) would close the file-existence gap.

Gap 2: SJ.pass1.out.tab location

Move pass-1 intermediates inside <prefix>_STARpass1/. STAR uses:

<prefix>_STARpass1/
    Log.final.out
    SJ.out.tab          # <-- pass-1 SJ tab, named SJ.out.tab inside the dir

This also gives a natural home if rustar ever wants to expose pass-1 stats separately, the way STAR does.

Severity

Low. Today nf-core/rnaseq works around both with optional: true outputs and a permissive *.tab glob. Mostly a drop-in compatibility cleanup; if either is out of scope for v0.1.x, please say so and we'll keep the workarounds.

Related: #22 mate fields (functional, higher severity); #26 prefix-as-dir and #25 --limitGenomeGenerateRAM rejection are filed separately as they need real code changes rather than additional output writers.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions