Tidy readClassList and quantData by lingminhao · Pull Request #565 · GoekeLab/bambu

lingminhao · 2026-04-16T04:44:43Z

Related Issue

Addresses #566

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (please specify if the change breaks existing functionality)
- Non breaking change (the feature doesn't change existing functionality)
- Breaking change (the feature that would cause existing functionality to not work as expected)
Documentation update
- to be updated
Performance optimization

Description

This PR aims to tidy the data structure for readClassList & quantData. In particular, we minimises the information required to store in quantData so that it is sufficient for generating unique count matrix, gene count matrix and can be used to perform EM quantification. We also aim to document all the datas stored in readClassList & quantData for clarify (i.e. what every column in the dataframe does)

Implementation Details

convert quantData into an object storing list of S4 object: I created a S4 class with 5 slots to store all necessary components for generating gene count matrix, unique count matrix and EM quantification.
refactor quantData:
- remove countMatrix and integrate them into readClassDt: I removed countMatrix which is the observed equiRC to cell matrix and directly integrate into the readClassDt as two list columns [columnsIDs, columnCounts].
- remove redundant stored data in quantData: I removed data frames like sampleName, nonuniqueCounts, incompatibleCountMatrix & countMatrix that is originally stored in quantData. Now quantData only contains readClassDt, incompatibleCounts, sampleData (added because it's originally stored in colData in se). Optional dataframe includes readToTranscriptMap (when trackReads is set to TRUE) & distTable (when returnDistTable is set to TRUE)
add gene and unique count matrix helper functions from quantData: generateUniqueCountFromQuantData(quantData, extendedAnno) and generateGeneCountFromQuantData(quantData, extendedAnno) was created to get the gene and unique count matrix for all samples & cells from the quantData

Impact of Changes

results are identical to the previous version.
the data structure for quantData is changed from list of se's to a list of S4 object (with 5 slots storing readClassDt, incompatibleCounts, sampleData, optionally readToTranscriptMap & distTable)
incompatibleCounts are updated to be a gene-by-cell matrix instead of gene-by-sample matrix
final se object now contains both readToTranscriptMaps (list of all samples) and distTables when the arguments are turned on.
readClassList, quantData, readToTranscriptMaps and distTables (from the final se) now becomes a named list instead of indexed numerically

test report on single-cell data:
se_identical.txt
transcript_comparison_report.pdf
bambu_benchmark.pdf

test report on bulk data:
se_identical.txt
transcript_comparison_report.pdf
bambu_benchmark.pdf

Checklist

My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation (vignettes, man pages).
I have tested the code on a full dataset, and any differences have been described in the Impact of Changes section.

…[columnIds, columnCounts] in readClassDt

…ists by sample names

Copilot

Pull request overview

This PR refactors how intermediate quantification inputs are represented and passed through the bambu pipeline by introducing a new quantData S4 class and by embedding per-column observed counts into readClassDt (via list-columns) instead of storing full count matrices in quantData.

Changes:

Introduces an exported S4 quantData class (with a $ accessor) and changes assignReadClasstoTranscripts() to return this object.
Refactors quantification to compute per-sample/cluster nobs from readClassDt’s columnIds/columnCounts list-columns rather than from stored count matrices.
Renames/propagates per-column identifiers in read-class construction (sampleID → columnID) and makes returned lists named rather than numerically indexed.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
R/bambu.R	Adapts quantification loop to new `quantData` structure; adds naming and attaches optional read/dist metadata to output SE.
R/bambu-quantify.R	Updates `bambu.quantify()` signature and computes `nobs` from embedded `columnIds/columnCounts`.
R/bambu-quantify_utilityFunctions.R	Adjusts equivalence-class utility calculations (e.g., removes `nobs` derivation in one helper).
R/bambu-processReads.R	Renames sample identifiers, adds column-level count metadata, and updates incompatible-count handling.
R/bambu-processReads_utilityConstructReadClasses.R	Propagates `columnID/columnIds` into read-class construction and stored metadata columns.
R/bambu-classes.R	Adds exported S4 `quantData` class and `$` method.
R/bambu-assignDist.R	Refactors quantData generation to create the new S4 object and adds quantData-based count helper functions.
NAMESPACE	Exports the new `quantData` class and `$` method.

Comments suppressed due to low confidence (1)

R/bambu-processReads.R:414

The sparseMatrix( call that constructs counts is missing its closing ) before the incompatible-counts section begins, which makes this function a syntax error and prevents the package from loading.

    counts <- sparseMatrix(
        i = rep(seq_along(counts.table), lengths(counts.table)),
        j = as.numeric(names(unlist(counts.table))),
        x = unlist(counts.table),
        dims = c(nrow(eqClasses), nrow(metadata(readClassFile)$sampleData)))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jonathangoeke · 2026-04-16T06:41:15Z

Code review

Found 2 issues:

quantData_i$sampleNames accesses a non-existent slot on the new quantData S4 class, which will throw a runtime error when clusters is provided in non-CompressedCharacterList format. The quantData class defines only 5 slots (sampleData, readClassDt, incompatibleCounts, distTable, readToTranscriptMap) and the $ method calls slot(x, name), so accessing sampleNames will fail. Likely should be quantData_i$sampleData$sampleName.

bambu/R/bambu.R

Lines 280 to 288 in ed08825

    
           clusterMaps <- NULL 
        
           for(j in seq_along(quantData_i$sampleNames)){ #load in a file per sample name provided 
        
             clusterMap <- fread(clusters[[j]], header = FALSE,  
        
                                 data.table = FALSE) 
        
             # read.table(clusters[[j]],  
        
             #     sep = ifelse(grepl(".tsv$",clusters[[j]]), "\t", ","),  
        
             #     header = FALSE) 
        
             clusterMap[,1] <- paste0(quantData_i$sampleNames[j], 
        
                                      "_",clusterMap[,1])

Class definition for reference:

bambu/R/bambu-classes.R

Lines 6 to 15 in ed08825

    
           #' @export 
        
           setClass("quantData", 
        
               slots = list( 
        
                   sampleData = "data.frame", 
        
                   readClassDt = "data.table", 
        
                   incompatibleCounts = "ANY", 
        
                   distTable = "ANY", 
        
                   readToTranscriptMap = "ANY" 
        
               ) 
        
           )

generateUniqueCountsFromQuantData, generateNonUniqueCountsFromQuantData, and generateGeneCountsFromQuantData are defined but never called from anywhere in the codebase. They appear to be replacements for the removed generateUniqueCounts / generateNonUniqueCounts functions, but are not wired into the pipeline.

bambu/R/bambu-assignDist.R

Lines 62 to 157 in ed08825

    
           #' @noRd 
        
           generateUniqueCountsFromQuantData <- function(quantData, annotations){ 
        
               uniqueCountsList <- lapply(quantData, function(x) { 
        
                   readClassDt <- x$readClassDt 
        
                   x_filtered <- readClassDt %>% filter(!multi_align & !is.na(eqClass.match)) 
        
                   uniqueCounts <- if(nrow(x_filtered) == 0) { 
        
                       sparseMatrix(i = 1, j = 1, x = 0, dims = c(length(annotations), nrow(x$sampleData))) 
        
                   } else { 
        
                       txids <- mcols(annotations)$txid 
        
                       i <- rep(match(x_filtered$txid, txids), lengths(x_filtered$columnIds)) 
        
                       j <- unlist(x_filtered$columnIds) 
        
                       x_vals <- unlist(x_filtered$columnCounts) 
        
                       sparseMatrix(i = i, j = j, x = x_vals, dims = c(length(annotations), nrow(x$sampleData))) 
        
                   } 
        
                   rownames(uniqueCounts) <- names(annotations) 
        
                   colnames(uniqueCounts) <- rownames(x$sampleData) 
        
                   return(uniqueCounts) 
        
               }) 
        
               return(do.call(cbind, uniqueCountsList)) 
        
           } 
        
           #' Generate non-unique counts 
        
           #' @noRd 
        
           generateNonUniqueCountsFromQuantData <- function(quantData, annotations){ 
        
               nonuniqueCountsList <- lapply(quantData, function(x_obj) { 
        
                   readClassDt <- x_obj$readClassDt 
        
                   #fuse multi align RCs by gene 
        
                   x <- readClassDt %>% filter(multi_align & !is.na(eqClass.match)) 
        
                   x <- x %>% distinct(eqClassId, .keep_all = TRUE) 
        
                   if(nrow(x) == 0) { 
        
                       genes <- levels(factor(unique(mcols(annotations)$GENEID))) 
        
                       geneMat <- sparseMatrix(i = 1, j = 1, x = 0, dims = c(length(genes), nrow(x_obj$sampleData)), dimnames = list(genes, NULL)) 
        
                       colnames(geneMat) <- rownames(x_obj$sampleData) 
        
                       return(geneMat) 
        
                   } 
        
                   # x has columnIds and columnCounts 
        
                   i <- rep(seq_along(x$gene_sid), lengths(x$columnIds)) 
        
                   j <- unlist(x$columnIds) 
        
                   x_vals <- unlist(x$columnCounts) 
        
                   nonuniqueCounts <- sparseMatrix(i = i, j = j, x = x_vals, dims = c(nrow(x), nrow(x_obj$sampleData))) 
        
                   if(nrow(x)>1 & length(unique(x$gene_sid))>1){ 
        
                       nonuniqueCounts.gene <- sparse.model.matrix(~ factor(x$gene_sid) - 1) 
        
                       nonuniqueCounts <- t(nonuniqueCounts.gene) %*% nonuniqueCounts 
        
                   } else{ 
        
                       warning("The factor variable 'gene_sid' has only one level. Adjusting output.") 
        
                       nonuniqueCounts.gene <- Matrix(1, nrow = nrow(x), ncol = 1, sparse = TRUE) 
        
                       nonuniqueCounts <- t(nonuniqueCounts.gene) %*% nonuniqueCounts 
        
                   } 
        
                   #covert ids into gene ids 
        
                   geneids <- as.numeric(levels(factor(x$gene_sid))) 
        
                   geneids <- x$txid[match(geneids, x$gene_sid)] 
        
                   geneids <- mcols(annotations)$GENEID[as.numeric(geneids)] 
        
                   rownames(nonuniqueCounts) <- geneids 
        
                   #create matrix for all annotated genes 
        
                   genes <- levels(factor(unique(mcols(annotations)$GENEID))) 
        
                   geneMat <- sparseMatrix(length(genes), nrow(x_obj$sampleData), x = 0) 
        
                   rownames(geneMat) <- genes 
        
                   if(!is.null(rownames(nonuniqueCounts))){ 
        
                     geneMat[match(rownames(nonuniqueCounts), rownames(geneMat)), ] <- nonuniqueCounts 
        
                   } 
        
                   colnames(geneMat) <- rownames(x_obj$sampleData) 
        
                   return(geneMat) 
        
               }) 
        
               return(do.call(cbind, nonuniqueCountsList)) 
        
           } 
        
           #' Generate gene counts directly from quantData 
        
           #' @noRd 
        
           generateGeneCountsFromQuantData <- function(quantData, annotations){ 
        
               # 1. Get unique transcript counts and aggregate them to gene level 
        
               uniqueCounts <- generateUniqueCountsFromQuantData(quantData, annotations) 
        
               geneIDs <- factor(mcols(annotations)$GENEID, levels = unique(mcols(annotations)$GENEID)) 
        
               geneCounts <- Matrix::fac2sparse(geneIDs) %*% uniqueCounts 
        
               # 2. Get non-unique counts and incompatible counts 
        
               nonuniqueCounts <- generateNonUniqueCountsFromQuantData(quantData, annotations) 
        
               incompatibleCounts <- do.call(cbind, lapply(quantData, function(x) x$incompatibleCounts)) 
        
               # 3. Align the rows/dimensions and sum them up 
        
               if(!is.null(incompatibleCounts)){ 
        
                   incompatibleCounts <- Matrix(incompatibleCounts[match(rownames(geneCounts), rownames(incompatibleCounts)), ], sparse = TRUE) 
        
                   geneCounts <- geneCounts + incompatibleCounts 
        
               } 
        
               if(!is.null(nonuniqueCounts)){ 
        
                   nonuniqueCounts <- Matrix(nonuniqueCounts[match(rownames(geneCounts), rownames(nonuniqueCounts)), ], sparse = TRUE) 
        
                   geneCounts <- geneCounts + nonuniqueCounts 
        
               } 
        
               return(geneCounts) 
        
           }

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

jonathangoeke · 2026-04-16T06:44:58Z

Code review (additional lower-confidence issues)

The following issues did not pass the default confidence threshold (80/100) and may be false positives, but are noted here for completeness.

Incomplete sampleID to columnID rename (confidence: 75/100) -- In the processByBam = FALSE code path, mcols(readGrgList)$sampleID is assigned on lines 92 and 94, while the processByBam = TRUE path and all downstream functions (isore.constructReadClasses, constructUnsplicedReadClasses, etc.) expect columnID. This will cause column-not-found errors when processByBam = FALSE is used.

bambu/R/bambu-processReads.R

Lines 91 to 95 in ed08825

    
           if(!isFALSE(demultiplexed)){  
        
             mcols(readGrgList)$sampleID <- as.numeric(mcols(readGrgList)$CB) 
        
           } else { 
        
             mcols(readGrgList)$sampleID <- i 
        
           }

Downstream code expecting columnID:

bambu/R/bambu-processReads_utilityConstructReadClasses.R

Lines 19 to 20 in ed08825

    
           mcols(reads.singleExon)$columnID <- mcols(readGrgList[ 
        
               elementNROWS(readGrgList) == 1])$columnID

In-place mutation of shared readClassDt via := (confidence: 50/100) -- bambu.quantify modifies readClassDt by reference using data.table's := operator. Since readClassDt is stored inside the quantData S4 slot, this mutates the shared object. Under fork-based parallelism (MulticoreParam) this is safe due to copy-on-write, but in sequential mode or if the quantData object is reused after quantification, the nobs column will contain stale values from the last iteration. The prior code used $ assignment which triggered R's copy-on-modify semantics.

bambu/R/bambu-quantify.R

Lines 11 to 22 in ed08825

    
           # Use data.table syntax for in-place creation of nobs column for the scope of this function 
        
           readClassDt[, nobs := { 
        
               ids <- columnIds[[1]] 
        
               if (is.null(ids) || length(ids) == 0 || is.na(ids[1])) { 
        
                   0L 
        
               } else if (length(columnIdx) == 1) { 
        
                   match_idx <- match(columnIdx, ids) 
        
                   if (is.na(match_idx)) 0L else columnCounts[[1]][match_idx] 
        
               } else { 
        
                   sum(columnCounts[[1]][ids %in% columnIdx]) 
        
               } 
        
           }, by = eqClassId]

sampleData naming ambiguity (confidence: 25/100) -- The new quantData S4 slot sampleData (a data.frame) shares its name with the bambu() function parameter sampleData (a vector of file paths). This was flagged as confusing in a review comment on PR Add sampleData argument to store multi sample's metadata information (bulk, single-cell, spatial) in se object #542. Not a functional bug, but may cause maintenance confusion.

bambu/R/bambu-classes.R

Lines 8 to 9 in ed08825

    
           slots = list( 
        
               sampleData = "data.frame",

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

jonathangoeke

I suggest to simplify how gene and transcript expression is calculated from quantData to be consistent with other bambu output and organise code accordingly
I have not tested the other code

cying111 · 2026-04-16T09:04:08Z

Tested on sample data, incompatibleCounts results changed
Error: Expected se_v4 to equal se_pr.
Differences:
Attributes: < Component “metadata”: Component “incompatibleCounts”: Attributes: < Component “i”: Mean relative difference: 0.7083333 > >
Attributes: < Component “metadata”: Component “incompatibleCounts”: Attributes: < Component “x”: Mean relative difference: 1.030303 > >

When running on multiple samples, same issue.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ment Replace generateUniqueCountsFromQuantData (returning a sparse matrix) with generateUniqueCountsSEFromQuantData, which returns a transcript-level SE with assays$uniqueCounts, metadata$incompatibleCounts, and metadata$nonuniqueCounts. This makes the output directly compatible with transcriptToGeneExpression. Move both functions into a new summarizeExpression.R (renamed from transcriptToGeneExpression.R) to reflect the broader scope of the file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ounts + incompatibleCounts + nonuniqueCounts Previously transcriptToGeneExpression aggregated EM counts by gene then added incompatibleCounts and nonuniqueCounts, which double-counted nonuniqueCounts since EM counts already incorporate them at the gene level. Now use assays$uniqueCounts as the base, aggregate by gene, then add incompatibleCounts and nonuniqueCounts explicitly. This gives the correct total: reads uniquely assigned to a transcript + reads multi-aligning within a gene + reads incompatible with any transcript. Also fix generateIncompatibleCounts and generateNonUniqueCountMatrix to use unique(mcols(annotations)$GENEID) (appearance order) instead of levels(factor(unique(...))), consistent with combineCountSes, preventing gene ID mislabelling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace catch-all $ method with typed accessors (getSampleData, getReadClassDt, getIncompatibleCounts, getNonuniqueCounts, getDistTable, getReadToTranscriptMap), add prototype, validity checks, and exported constructor constructQuantData. Type incompatibleCounts and nonuniqueCounts slots as sparseMatrix. Update all call sites in bambu.R, bambu-assignDist.R, and summarizeExpression.R. Also fixes pre-existing sampleNames typo bug in bambu.R. Drop exportClass(quantData) and exportMethods("$") from NAMESPACE. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nonuniqueCountMatrix is now computed on demand in bambu.R and generateUniqueCountsSEFromQuantData, and stored only in the uniqueCountsSe and final SE metadata. Moved generateNonUniqueCountMatrix to bambu_utilityFunctions.R as it is shared across both call sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cying111 · 2026-04-17T02:48:32Z

finished testing on different test datasets, and also different parameters, surprisingly, this also fixs a bug that readTranscriptMap not reported when readTranscriptMap is set to TRUE in devel_pre_v4.

- revert transcriptToGeneExpression to use assays$counts + incompatibleCounts (+ nonuniqueCounts if present in metadata, for uniqueCounts SE path) - rename summarizeExpression.R to transcriptToGeneExpression.R - move generateNonUniqueCountMatrix to transcriptToGeneExpression.R - remove nonuniqueCounts from bambu.quantify and combineCountSes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tag SE objects with metadata$seType to distinguish their content: "uniqueCounts", "EMCounts", and "geneCounts" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jonathangoeke

test all changes
document downstream issues
add note on metadata slot for other SE outputs in bambu/create draft issue

Use levels(factor(unique(...))) to ensure genes vector is alphabetically sorted, matching the GENEID.i index scheme assigned in bambu_utilityFunctions.R. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After decoding numeric indices to gene names, reorder the output matrix to match annotation order so values align with the rownames assigned in combineCountSes. Use drop=FALSE to preserve sparse matrix class for single-column inputs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace rowSums+names with direct rownames/as.vector to avoid collapsing multi-column subsets and to keep rownames attached as gene IDs for the GENEID.i join in bambu.quantify. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lapply on a named quantData list preserves names, causing do.call(cbind/rbind) to prepend the list element name (sampleName) to each column/row name with a period separator — producing e.g. demultiplexed2.demultiplexed2_TCAGATGTCACCAGGC instead of demultiplexed2_TCAGATGTCACCAGGC. unname() the lapply results before cbind/rbind in generateUniqueCountsSEFromQuantData. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tyFunctions.R Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lingminhao added 7 commits April 15, 2026 20:10

convert quantData into an object storing list of S4 object

bf32099

Refactor quantData: remove countMatrix and replace with list columns …

abc7411

…[columnIds, columnCounts] in readClassDt

remove redundant stored data in quantData

ce1259e

add readToTranscriptMaps to final se when trackReads is TRUE

9ce39c1

Add gene and unique count matrix helpers from quantData

c7ed6d3

add colnames and colData to readClassList

ce96f7b

name readClassList, quantData, readToTranscriptMaps, and distTables l…

ed08825

…ists by sample names

lingminhao changed the title ~~Tidy read class file quant data~~ Tidy readClassList and quantData Apr 16, 2026

lingminhao requested a review from jonathangoeke April 16, 2026 05:12

lingminhao linked an issue Apr 16, 2026 that may be closed by this pull request

[Refactor]: tidy readClassList & quantData #566

Closed

2 tasks

lingminhao requested review from Copilot and cying111 April 16, 2026 06:16

Copilot started reviewing on behalf of lingminhao April 16, 2026 06:17 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

jonathangoeke reviewed Apr 16, 2026

View reviewed changes

Comment thread R/bambu-assignDist.R Outdated

Comment thread R/bambu-assignDist.R Outdated

Comment thread R/bambu-assignDist.R Outdated

Comment thread R/bambu-assignDist.R

Comment thread R/bambu-classes.R

lingminhao mentioned this pull request Apr 16, 2026

V2 GoekeLab/bambu-singlecell-spatial#25

Open