Skip to content

V2#25

Open
ch99l wants to merge 53 commits intodevelfrom
v2
Open

V2#25
ch99l wants to merge 53 commits intodevelfrom
v2

Conversation

@ch99l
Copy link
Copy Markdown
Collaborator

@ch99l ch99l commented Apr 15, 2026

Major Update of Bambu-Single Cell Pipeline

Summary of Changes Made:

  1. Support for new single cell and spatial 10x kits
  2. Refactor main.nf script (move process-specific code to modules/subworkflows)
  3. Create assets directory (10x_config) to store static configuration files
  4. Update nextflow.config file
  5. Include new processes in pipeline to improve fastq processing (chopper and cutadapt)
  6. Fix Nextflow parallel processing logic
  7. Update README.md file

@ClareRobin
Copy link
Copy Markdown

ClareRobin commented Apr 15, 2026

Code review (From Claude /code-review plugin using our best practices instruction md file)

Found 5 issues:

  1. All containers use :latest tags — reproducibility is not guaranteed (CLAUDE.md says "Always pin to an explicit version — never use latest")

    Affects every module: bambu-pipe-r:latest, bambu-pipe-preprocess:latest, bambu-pipe-alignment:latest. Replace with pinned digest or immutable tag, e.g. via Seqera/Wave for CLI tools and a digest for GHCR images.

    publishDir "$params.output_dir", mode: 'copy', pattern: '*extended_annotations.gtf'
    container "ghcr.io/ch99l/bambu-pipe-r:latest"
    label "medium_cpu"
    label "high_mem"
    label "medium"

  2. Processes are defined inside subworkflow files rather than imported from separate module files (CLAUDE.md says "Do not define processes inside subworkflows — import from modules only")

    subworkflows/alignment.nf defines MINIMAP_BUILD_INDEX, PAFTOOLS_GFF2BED, and MINIMAP_ALIGNMENT inline. subworkflows/prepare_input_standard.nf defines EXTRACT_10X_BARCODES and EXTRACT_10X_SPATIAL_COORDINATES inline. Each should live in its own modules/<tool>/main.nf.

    process MINIMAP_BUILD_INDEX{
    container "ghcr.io/ch99l/bambu-pipe-alignment:latest"
    label "low_cpu"
    label "medium_mem"
    label "short"
    input:
    path(genome)
    val(fastq_count)
    when: fastq_count > 0 // only build index if there are fastq samples to process
    output:
    path('ref.mmi')
    script:
    """
    minimap2 -k15 -w5 -d ref.mmi $genome # -k and -w flags are used for both splice:hq and splice presets
    """
    }

  3. No process emits versions.yml (CLAUDE.md says "Every process must emit versions.yml alongside its real outputs, using explicit emit: names")

    None of the new or modified modules (BAMBU, BAMBU_EM, BAMBU_CONSTRUCT_READ_CLASS, BAMBU_PREPARE_ANNOTATION, PREPROCESS_FASTQ, SEURAT_CLUSTERING, MINIMAP_ALIGNMENT, etc.) emit a versions.yml file. This blocks version tracking and aggregation.

    output:
    path ('*quantData.rds'), emit: quant_data
    path ('*extended_annotations.rds'), emit: extended_annotations
    path ('*extended_annotations.gtf'), emit: extended_annotations_gtf

  4. timeline, report, trace, and dag are not enabled in nextflow.config (CLAUDE.md says "nextflow.config must enable timeline, report, trace, dag")

    The new nextflow.config adds params, process, and profiles blocks but omits the four reporting directives entirely.

    params {
    // Mandatory input
    input = null // Path to samplesheet .csv file
    genome = null // Path to .fa or .fasta file
    annotation = null // Path to .gtf or .gff file
    // Optional: Output directory
    output_dir = "output" // Path to output directory
    /*
    Optional: Samplesheet settings (Non Visium HD samples only)
    Note: Use this if all samples share the same chemistry/technology
    */
    chemistry = null // Examples: "10x3v2", "10x3v3", "10x5v2", "visium-v1"
    technology = null // Options: "ONT", "PacBio"
    // Optional: Early termination
    early_stop_stage = null // Options: "rds", "bam"
    // Optional: Q-score filtering
    qscore_filtering = true // boolean
    // Optional: Bambu parameters
    ndr = null // null or float
    deduplicate_umis = true // boolean
    // Optional: Quantification mode
    quantification_mode = "EM_clusters" // Options: "no_quant", "EM", "EM_clusters"
    // Optional: Seurat clustering
    resolution = 0.8 // float
    // Development parameters (DO NOT EDIT)
    bambu_path = null
    valid_chemistries = ['10x3v2', '10x3v3', '10x3v4', '10x5v2', '10x5v3', 'visium-v1', 'visium-v2', 'visium-v3', 'visium-v4', 'visium-v5']
    valid_technologies = ['ONT', 'PacBio']
    valid_quantification_modes = ['no_quant', 'EM', 'EM_clusters']
    valid_early_stop_stages = ['rds', 'bam', null]
    save_intermediates = false
    qfilter_threshold = 10
    flexiplex_f_5prime = 8
    flexiplex_f_3prime = 13
    flexiplex_e = 1
    process_by_chromosome = true
    fusion_mode = false
    jaffal_ref_dir = null
    jaffal_code_dir = "$projectDir/jaffal"
    cellranger_dir = "/opt/spaceranger-4.0.1/lib/python/cellranger/barcodes"
    }
    process {
    // Retry strategy: up to 3 attempts, doubling memory and adding CPUs each retry
    maxRetries = 3
    errorStrategy = { task.exitStatus in [130, 137, 139, 143] ? 'retry' : 'finish' }
    // CPU Labels (scale up with retries)
    withLabel: 'low_cpu' { cpus = { 4 * (1 + task.attempt) } }
    withLabel: 'medium_cpu' { cpus = { 16 * task.attempt } }
    withLabel: 'high_cpu' { cpus = { Math.min(32 * task.attempt, 128) } }
    // Memory Labels (double each retry)
    withLabel: 'low_mem' { memory = { 16.GB * Math.pow(2, task.attempt - 1) } }
    withLabel: 'medium_mem' { memory = { 64.GB * Math.pow(2, task.attempt - 1) } }
    withLabel: 'high_mem' { memory = { 128.GB * Math.pow(2, task.attempt - 1) } }
    // Time Labels (increase with retries)
    withLabel: 'short' { time = { 1.h * task.attempt } }
    withLabel: 'medium' { time = { 4.h * task.attempt } }
    withLabel: 'long' { time = { 12.h * task.attempt } }
    // CPU, Memory, and Time allocation for Bambu (Modify if necessary, especially when using pipeline for multiple samples)
    withName: 'BAMBU' {
    cpus = { 16 * task.attempt }
    memory = { 128.GB * Math.pow(2, task.attempt - 1) }
    time = { 4.h * task.attempt }
    }
    withName: 'BAMBU_EM' {
    cpus = { 16 * task.attempt }
    memory = { 128.GB * Math.pow(2, task.attempt - 1) }
    time = { 12.h * task.attempt } // increase time allocation if params.quantification_mode = EM
    }
    }
    profiles {
    // Container profiles
    singularity {
    singularity.enabled = true
    singularity.autoMounts = true

  5. NDR receives the literal string "NULL" instead of R's NULL when params.ndr is unset (bug)

    In main.nf, def ndr = params.ndr ?: 'NULL' produces the Groovy string "NULL". When interpolated into the Rscript block as NDR = $ndr, R sees NDR = NULL — which is actually valid R NULL only because the bare word NULL is an R keyword, so this works correctly. However the R Dockerfile installs bambu from the branch reference GoekeLab/bambu@devel_pre_v4 rather than a pinned commit or tag, so the installed version is not reproducible. CLAUDE.md says "Pin... all package versions".

    # install Seurat Object (v5.3.0), Seurat (v5.4.0), and Bambu
    RUN R -e "install.packages(c('pak', 'devtools', 'BiocManager'), repos='https://cloud.r-project.org')"
    RUN R -e "pak::pkg_install(c('SeuratObject@5.3.0', 'Seurat@5.4.0', 'GoekeLab/bambu@devel_pre_v4'))"

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown

@ClareRobin ClareRobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (human) review: (summary)

Biggest issue:

  • example data (currently as is) can't easily be run (also all the samplesheets currently in examples are out of date) - it looks like the reads_chr9_1_1000000.fastq.gz files are 10x5v2, to be able to run the pipeline I had to:
  1. make a new samplesheet
  2. fix resource limits in nextflow config for profile local
  3. fix the staging clash
  4. fix sampleData (spatial) parameter in process BAMBU {}

Other (more minor issues):

  • Various READMe comments (inconsistencies between README & actual pipline).

Comment thread .gitignore
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clare (human) comments :)
suggested to have a more comprehensive gitignore e.g.

.nextflow.log*
work/
results/
*.command.*
*.pyc
.DS_Store
.vscode/```

Comment thread README.md
```
nextflow run GoekeLab/bambu-singlecell-spatial \
--reads $PWD/examples/samplesheet_basic.csv \ # See the arguments section for format specifications
nextflow run $PWD/bambu-singlecell-spatial \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clare (human) comment:
This example command in the README has contradictory $PWD references — $PWD/bambu-singlecell-spatial assumes you're in the parent directory, but $PWD/examples/... assumes you're inside the repo.

should either be:
nextflow run $PWD/bambu-singlecell-spatial \ --input $PWD/bambu-singlecell-spatial/examples/samplesheet.csv \ --genome $PWD/bambu-singlecell-spatial/examples/Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa \ --annotation $PWD/bambu-singlecell-spatial/examples/Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf \ -profile singularity,hpc

or

nextflow run . \ --input $PWD/examples/samplesheet.csv \ --genome $PWD/examples/Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa \ --annotation $PWD/examples/Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf \ -profile singularity,hpc

(or run main.nf also works)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I think run main.nf is the safest to go?

Comment thread subworkflows/alignment.nf Outdated
Comment thread subworkflows/alignment.nf Outdated
Comment thread nextflow.config
Comment thread README.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clare (human) comment:
Note - all the samplesheets currently in examples/ have incorrect headers for how the pipeline runs now (so need to be updated)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note: the sample sheet that Clare generated for review uses the example data in the repo, where it only contains 10x 5' v2 ONT reads. Both files are actually also identical, which means that Bambu is most likely going to generate the same read class files for both.

Also, this makes it hard to validate if the pipeline still truly functions as intended when using other chemistries/technologies. Maybe an extra 1 to 2 samples can be sourced and included in the sample sheet to reflect the functionality?

Comment thread main.nf
Comment thread modules/bambu.nf
Comment thread modules/bambu.nf
Comment thread nextflow.config
@ClareRobin
Copy link
Copy Markdown

ClareRobin commented Apr 16, 2026

Github Copilot Code Review (using our nextflow best practices github instruction md file)

Findings:

  1. High: barcode extraction is hard-wired to a path that does not exist in the declared spaceranger container, so the pipeline dies before any real sample processing starts. The failure is in subworkflows/prepare_input_standard.nf and subworkflows/prepare_input_standard.nf, with the path coming from nextflow.config. I reproduced this with a stub-run: EXTRACT_10X_BARCODES failed trying to copy 737K-august-2016.txt from /opt/spaceranger-4.0.1/lib/python/cellranger/barcodes, which is not present in that container. As written, this blocks all chemistries.

  2. High: multi-sample non-visium runs will hit a Nextflow input file collision in BAMBU. For non-visium chemistries, subworkflows/prepare_input_standard.nf creates the same placeholder spatial file name per chemistry, then attaches that same path to every sample in subworkflows/prepare_input_standard.nf. Later, main.nf collects all of those paths into one list and passes them into modules/bambu.nf. Nextflow rejects duplicate file names within a single path-list input, so two 10x3v2 samples will fail before BAMBU starts.

  3. Medium: the branch’s own documented and shipped example inputs no longer match the parser contract, so the advertised smoke-test path is broken. The parser now requires a samplesheet with a path column in subworkflows/prepare_input_standard.nf, but the bundled examples still use fastq or bam columns in examples/samplesheet_basic.csv, examples/samplesheet_bam_example.csv, and examples/samplesheet_custom_example.csv. README also tells users to run a non-existent sample sheet in README.md and points to a non-existent template in README.md. I verified that a stub-run against the bundled example data now fails immediately with “Samplesheet is missing a required path column.”

Clare's (human) note on this feedback

For issue 1: there are currently a docker and singularity profiles, this issue isn't a problem with either of those profiles, but if the user runs the pipeline without a docker or singularity profile (i.e. using locally installed software) the spaceranger filepath will fail. If we want to avoid the possible failure & make this unambiguous, we could add a startup validation in main.nf that exits unless docker or singularity is enabled.

Comment thread modules/bambu.nf
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clare (human) comment:
As discussed - move all the bambu processes to a modules/bambu folder with .nf files for each command.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - related, since bambu is changing, suggest you add a bambu_container param (or equivalent) so if there's a new bambu container you don't have to manually go into each .nf file & update the container

Copy link
Copy Markdown

@hafiz-ismail hafiz-ismail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly general comments

Comment thread README.md
```
nextflow run GoekeLab/bambu-singlecell-spatial \
--reads $PWD/examples/samplesheet_basic.csv \ # See the arguments section for format specifications
nextflow run $PWD/bambu-singlecell-spatial \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I think run main.nf is the safest to go?

Comment thread README.md

The BambuSC pipeline can be started from from raw reads (fastq.gz) or from a demultiplexed .bam file if you have already produced these (from earlier runs of this pipeline or other upstream tools). Therefore either the --reads or --bams argument is mandatory depending on your input files.
Executor profiles:
- `hpc`: execute pipeline on an HPC system
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it good to have a profile for 'local', or a note to 'leave blank for local' as well?

Comment thread README.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note: the sample sheet that Clare generated for review uses the example data in the repo, where it only contains 10x 5' v2 ONT reads. Both files are actually also identical, which means that Bambu is most likely going to generate the same read class files for both.

Also, this makes it hard to validate if the pipeline still truly functions as intended when using other chemistries/technologies. Maybe an extra 1 to 2 samples can be sourced and included in the sample sheet to reflect the functionality?

cellMixs = list()
source(file.path(bin_path,"/utilityFunctions.R"))
for(quantData in se){
quantData.gene = transcriptToGeneExpression(quantData)
Copy link
Copy Markdown
Collaborator

@lingminhao lingminhao Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just putting a future comment related to my PR here. After it has been merged quantData is no longer a list of se object. You might need to change the code here for successful run. I can provide you the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants