Skip to content

Tidy readClassList and quantData#565

Open
lingminhao wants to merge 25 commits intodevel_pre_v4from
tidy_readClassFile_quantData
Open

Tidy readClassList and quantData#565
lingminhao wants to merge 25 commits intodevel_pre_v4from
tidy_readClassFile_quantData

Conversation

@lingminhao
Copy link
Copy Markdown
Collaborator

@lingminhao lingminhao commented Apr 16, 2026

Related Issue

Addresses #566

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (please specify if the change breaks existing functionality)
    • Non breaking change (the feature doesn't change existing functionality)
    • Breaking change (the feature that would cause existing functionality to not work as expected)
  • Documentation update
    - to be updated
  • Performance optimization

Description

This PR aims to tidy the data structure for readClassList & quantData. In particular, we minimises the information required to store in quantData so that it is sufficient for generating unique count matrix, gene count matrix and can be used to perform EM quantification. We also aim to document all the datas stored in readClassList & quantData for clarify (i.e. what every column in the dataframe does)

Implementation Details

  • convert quantData into an object storing list of S4 object: I created a S4 class with 5 slots to store all necessary components for generating gene count matrix, unique count matrix and EM quantification.
  • refactor quantData:
    • remove countMatrix and integrate them into readClassDt: I removed countMatrix which is the observed equiRC to cell matrix and directly integrate into the readClassDt as two list columns [columnsIDs, columnCounts].
    • remove redundant stored data in quantData: I removed data frames like sampleName, nonuniqueCounts, incompatibleCountMatrix & countMatrix that is originally stored in quantData. Now quantData only contains readClassDt, incompatibleCounts, sampleData (added because it's originally stored in colData in se). Optional dataframe includes readToTranscriptMap (when trackReads is set to TRUE) & distTable (when returnDistTable is set to TRUE)
  • add gene and unique count matrix helper functions from quantData: generateUniqueCountFromQuantData(quantData, extendedAnno) and generateGeneCountFromQuantData(quantData, extendedAnno) was created to get the gene and unique count matrix for all samples & cells from the quantData

Impact of Changes

  • results are identical to the previous version.
  • the data structure for quantData is changed from list of se's to a list of S4 object (with 5 slots storing readClassDt, incompatibleCounts, sampleData, optionally readToTranscriptMap & distTable)
  • incompatibleCounts are updated to be a gene-by-cell matrix instead of gene-by-sample matrix
  • final se object now contains both readToTranscriptMaps (list of all samples) and distTables when the arguments are turned on.
  • readClassList, quantData, readToTranscriptMaps and distTables (from the final se) now becomes a named list instead of indexed numerically

test report on single-cell data:
se_identical.txt
transcript_comparison_report.pdf
bambu_benchmark.pdf

test report on bulk data:
se_identical.txt
transcript_comparison_report.pdf
bambu_benchmark.pdf

Checklist

  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation (vignettes, man pages).
  • I have tested the code on a full dataset, and any differences have been described in the Impact of Changes section.

@lingminhao lingminhao changed the title Tidy read class file quant data Tidy readClassList and quantData Apr 16, 2026
@lingminhao lingminhao linked an issue Apr 16, 2026 that may be closed by this pull request
2 tasks
@lingminhao lingminhao requested review from Copilot and cying111 April 16, 2026 06:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors how intermediate quantification inputs are represented and passed through the bambu pipeline by introducing a new quantData S4 class and by embedding per-column observed counts into readClassDt (via list-columns) instead of storing full count matrices in quantData.

Changes:

  • Introduces an exported S4 quantData class (with a $ accessor) and changes assignReadClasstoTranscripts() to return this object.
  • Refactors quantification to compute per-sample/cluster nobs from readClassDt’s columnIds/columnCounts list-columns rather than from stored count matrices.
  • Renames/propagates per-column identifiers in read-class construction (sampleIDcolumnID) and makes returned lists named rather than numerically indexed.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
R/bambu.R Adapts quantification loop to new quantData structure; adds naming and attaches optional read/dist metadata to output SE.
R/bambu-quantify.R Updates bambu.quantify() signature and computes nobs from embedded columnIds/columnCounts.
R/bambu-quantify_utilityFunctions.R Adjusts equivalence-class utility calculations (e.g., removes nobs derivation in one helper).
R/bambu-processReads.R Renames sample identifiers, adds column-level count metadata, and updates incompatible-count handling.
R/bambu-processReads_utilityConstructReadClasses.R Propagates columnID/columnIds into read-class construction and stored metadata columns.
R/bambu-classes.R Adds exported S4 quantData class and $ method.
R/bambu-assignDist.R Refactors quantData generation to create the new S4 object and adds quantData-based count helper functions.
NAMESPACE Exports the new quantData class and $ method.
Comments suppressed due to low confidence (1)

R/bambu-processReads.R:414

  • The sparseMatrix( call that constructs counts is missing its closing ) before the incompatible-counts section begins, which makes this function a syntax error and prevents the package from loading.
    counts <- sparseMatrix(
        i = rep(seq_along(counts.table), lengths(counts.table)),
        j = as.numeric(names(unlist(counts.table))),
        x = unlist(counts.table),
        dims = c(nrow(eqClasses), nrow(metadata(readClassFile)$sampleData)))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/bambu-processReads.R
Comment thread R/bambu-assignDist.R
Comment thread R/bambu-classes.R Outdated
Comment thread R/bambu.R Outdated
Comment thread R/bambu-processReads_utilityConstructReadClasses.R
Comment thread R/bambu-processReads_utilityConstructReadClasses.R
Comment thread R/bambu-assignDist.R Outdated
Comment thread R/bambu-assignDist.R Outdated
@jonathangoeke
Copy link
Copy Markdown
Member

Code review

Found 2 issues:

  1. quantData_i$sampleNames accesses a non-existent slot on the new quantData S4 class, which will throw a runtime error when clusters is provided in non-CompressedCharacterList format. The quantData class defines only 5 slots (sampleData, readClassDt, incompatibleCounts, distTable, readToTranscriptMap) and the $ method calls slot(x, name), so accessing sampleNames will fail. Likely should be quantData_i$sampleData$sampleName.

bambu/R/bambu.R

Lines 280 to 288 in ed08825

clusterMaps <- NULL
for(j in seq_along(quantData_i$sampleNames)){ #load in a file per sample name provided
clusterMap <- fread(clusters[[j]], header = FALSE,
data.table = FALSE)
# read.table(clusters[[j]],
# sep = ifelse(grepl(".tsv$",clusters[[j]]), "\t", ","),
# header = FALSE)
clusterMap[,1] <- paste0(quantData_i$sampleNames[j],
"_",clusterMap[,1])

Class definition for reference:

bambu/R/bambu-classes.R

Lines 6 to 15 in ed08825

#' @export
setClass("quantData",
slots = list(
sampleData = "data.frame",
readClassDt = "data.table",
incompatibleCounts = "ANY",
distTable = "ANY",
readToTranscriptMap = "ANY"
)
)

  1. generateUniqueCountsFromQuantData, generateNonUniqueCountsFromQuantData, and generateGeneCountsFromQuantData are defined but never called from anywhere in the codebase. They appear to be replacements for the removed generateUniqueCounts / generateNonUniqueCounts functions, but are not wired into the pipeline.

bambu/R/bambu-assignDist.R

Lines 62 to 157 in ed08825

#' @noRd
generateUniqueCountsFromQuantData <- function(quantData, annotations){
uniqueCountsList <- lapply(quantData, function(x) {
readClassDt <- x$readClassDt
x_filtered <- readClassDt %>% filter(!multi_align & !is.na(eqClass.match))
uniqueCounts <- if(nrow(x_filtered) == 0) {
sparseMatrix(i = 1, j = 1, x = 0, dims = c(length(annotations), nrow(x$sampleData)))
} else {
txids <- mcols(annotations)$txid
i <- rep(match(x_filtered$txid, txids), lengths(x_filtered$columnIds))
j <- unlist(x_filtered$columnIds)
x_vals <- unlist(x_filtered$columnCounts)
sparseMatrix(i = i, j = j, x = x_vals, dims = c(length(annotations), nrow(x$sampleData)))
}
rownames(uniqueCounts) <- names(annotations)
colnames(uniqueCounts) <- rownames(x$sampleData)
return(uniqueCounts)
})
return(do.call(cbind, uniqueCountsList))
}
#' Generate non-unique counts
#' @noRd
generateNonUniqueCountsFromQuantData <- function(quantData, annotations){
nonuniqueCountsList <- lapply(quantData, function(x_obj) {
readClassDt <- x_obj$readClassDt
#fuse multi align RCs by gene
x <- readClassDt %>% filter(multi_align & !is.na(eqClass.match))
x <- x %>% distinct(eqClassId, .keep_all = TRUE)
if(nrow(x) == 0) {
genes <- levels(factor(unique(mcols(annotations)$GENEID)))
geneMat <- sparseMatrix(i = 1, j = 1, x = 0, dims = c(length(genes), nrow(x_obj$sampleData)), dimnames = list(genes, NULL))
colnames(geneMat) <- rownames(x_obj$sampleData)
return(geneMat)
}
# x has columnIds and columnCounts
i <- rep(seq_along(x$gene_sid), lengths(x$columnIds))
j <- unlist(x$columnIds)
x_vals <- unlist(x$columnCounts)
nonuniqueCounts <- sparseMatrix(i = i, j = j, x = x_vals, dims = c(nrow(x), nrow(x_obj$sampleData)))
if(nrow(x)>1 & length(unique(x$gene_sid))>1){
nonuniqueCounts.gene <- sparse.model.matrix(~ factor(x$gene_sid) - 1)
nonuniqueCounts <- t(nonuniqueCounts.gene) %*% nonuniqueCounts
} else{
warning("The factor variable 'gene_sid' has only one level. Adjusting output.")
nonuniqueCounts.gene <- Matrix(1, nrow = nrow(x), ncol = 1, sparse = TRUE)
nonuniqueCounts <- t(nonuniqueCounts.gene) %*% nonuniqueCounts
}
#covert ids into gene ids
geneids <- as.numeric(levels(factor(x$gene_sid)))
geneids <- x$txid[match(geneids, x$gene_sid)]
geneids <- mcols(annotations)$GENEID[as.numeric(geneids)]
rownames(nonuniqueCounts) <- geneids
#create matrix for all annotated genes
genes <- levels(factor(unique(mcols(annotations)$GENEID)))
geneMat <- sparseMatrix(length(genes), nrow(x_obj$sampleData), x = 0)
rownames(geneMat) <- genes
if(!is.null(rownames(nonuniqueCounts))){
geneMat[match(rownames(nonuniqueCounts), rownames(geneMat)), ] <- nonuniqueCounts
}
colnames(geneMat) <- rownames(x_obj$sampleData)
return(geneMat)
})
return(do.call(cbind, nonuniqueCountsList))
}
#' Generate gene counts directly from quantData
#' @noRd
generateGeneCountsFromQuantData <- function(quantData, annotations){
# 1. Get unique transcript counts and aggregate them to gene level
uniqueCounts <- generateUniqueCountsFromQuantData(quantData, annotations)
geneIDs <- factor(mcols(annotations)$GENEID, levels = unique(mcols(annotations)$GENEID))
geneCounts <- Matrix::fac2sparse(geneIDs) %*% uniqueCounts
# 2. Get non-unique counts and incompatible counts
nonuniqueCounts <- generateNonUniqueCountsFromQuantData(quantData, annotations)
incompatibleCounts <- do.call(cbind, lapply(quantData, function(x) x$incompatibleCounts))
# 3. Align the rows/dimensions and sum them up
if(!is.null(incompatibleCounts)){
incompatibleCounts <- Matrix(incompatibleCounts[match(rownames(geneCounts), rownames(incompatibleCounts)), ], sparse = TRUE)
geneCounts <- geneCounts + incompatibleCounts
}
if(!is.null(nonuniqueCounts)){
nonuniqueCounts <- Matrix(nonuniqueCounts[match(rownames(geneCounts), rownames(nonuniqueCounts)), ], sparse = TRUE)
geneCounts <- geneCounts + nonuniqueCounts
}
return(geneCounts)
}

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@jonathangoeke
Copy link
Copy Markdown
Member

Code review (additional lower-confidence issues)

The following issues did not pass the default confidence threshold (80/100) and may be false positives, but are noted here for completeness.

  1. Incomplete sampleID to columnID rename (confidence: 75/100) -- In the processByBam = FALSE code path, mcols(readGrgList)$sampleID is assigned on lines 92 and 94, while the processByBam = TRUE path and all downstream functions (isore.constructReadClasses, constructUnsplicedReadClasses, etc.) expect columnID. This will cause column-not-found errors when processByBam = FALSE is used.

if(!isFALSE(demultiplexed)){
mcols(readGrgList)$sampleID <- as.numeric(mcols(readGrgList)$CB)
} else {
mcols(readGrgList)$sampleID <- i
}

Downstream code expecting columnID:

mcols(reads.singleExon)$columnID <- mcols(readGrgList[
elementNROWS(readGrgList) == 1])$columnID

  1. In-place mutation of shared readClassDt via := (confidence: 50/100) -- bambu.quantify modifies readClassDt by reference using data.table's := operator. Since readClassDt is stored inside the quantData S4 slot, this mutates the shared object. Under fork-based parallelism (MulticoreParam) this is safe due to copy-on-write, but in sequential mode or if the quantData object is reused after quantification, the nobs column will contain stale values from the last iteration. The prior code used $ assignment which triggered R's copy-on-modify semantics.

bambu/R/bambu-quantify.R

Lines 11 to 22 in ed08825

# Use data.table syntax for in-place creation of nobs column for the scope of this function
readClassDt[, nobs := {
ids <- columnIds[[1]]
if (is.null(ids) || length(ids) == 0 || is.na(ids[1])) {
0L
} else if (length(columnIdx) == 1) {
match_idx <- match(columnIdx, ids)
if (is.na(match_idx)) 0L else columnCounts[[1]][match_idx]
} else {
sum(columnCounts[[1]][ids %in% columnIdx])
}
}, by = eqClassId]

  1. sampleData naming ambiguity (confidence: 25/100) -- The new quantData S4 slot sampleData (a data.frame) shares its name with the bambu() function parameter sampleData (a vector of file paths). This was flagged as confusing in a review comment on PR Add sampleData argument to store multi sample's metadata information (bulk, single-cell, spatial) in se object #542. Not a functional bug, but may cause maintenance confusion.

slots = list(
sampleData = "data.frame",

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Member

@jonathangoeke jonathangoeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I suggest to simplify how gene and transcript expression is calculated from quantData to be consistent with other bambu output and organise code accordingly
  • I have not tested the other code

Comment thread R/bambu-assignDist.R Outdated
Comment thread R/bambu-assignDist.R Outdated
Comment thread R/bambu-assignDist.R Outdated
Comment thread R/bambu-assignDist.R
Comment thread R/bambu-classes.R
@cying111
Copy link
Copy Markdown
Collaborator

cying111 commented Apr 16, 2026

Tested on sample data, incompatibleCounts results changed
Error: Expected se_v4 to equal se_pr.
Differences:
Attributes: < Component “metadata”: Component “incompatibleCounts”: Attributes: < Component “i”: Mean relative difference: 0.7083333 > >
Attributes: < Component “metadata”: Component “incompatibleCounts”: Attributes: < Component “x”: Mean relative difference: 1.030303 > >

When running on multiple samples, same issue.

Comment thread R/bambu-assignDist.R
Comment thread R/bambu-assignDist.R Outdated
Comment thread R/bambu-assignDist.R Outdated
Comment thread R/bambu-assignDist.R Outdated
Comment thread R/bambu-assignDist.R Outdated
lingminhao and others added 5 commits April 16, 2026 22:39
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ment

Replace generateUniqueCountsFromQuantData (returning a sparse matrix) with
generateUniqueCountsSEFromQuantData, which returns a transcript-level SE with
assays$uniqueCounts, metadata$incompatibleCounts, and metadata$nonuniqueCounts.
This makes the output directly compatible with transcriptToGeneExpression.

Move both functions into a new summarizeExpression.R (renamed from
transcriptToGeneExpression.R) to reflect the broader scope of the file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ounts + incompatibleCounts + nonuniqueCounts

Previously transcriptToGeneExpression aggregated EM counts by gene then added
incompatibleCounts and nonuniqueCounts, which double-counted nonuniqueCounts
since EM counts already incorporate them at the gene level.

Now use assays$uniqueCounts as the base, aggregate by gene, then add
incompatibleCounts and nonuniqueCounts explicitly. This gives the correct
total: reads uniquely assigned to a transcript + reads multi-aligning within
a gene + reads incompatible with any transcript.

Also fix generateIncompatibleCounts and generateNonUniqueCountMatrix to use
unique(mcols(annotations)$GENEID) (appearance order) instead of
levels(factor(unique(...))), consistent with combineCountSes, preventing
gene ID mislabelling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace catch-all $ method with typed accessors (getSampleData,
getReadClassDt, getIncompatibleCounts, getNonuniqueCounts, getDistTable,
getReadToTranscriptMap), add prototype, validity checks, and exported
constructor constructQuantData. Type incompatibleCounts and nonuniqueCounts
slots as sparseMatrix. Update all call sites in bambu.R,
bambu-assignDist.R, and summarizeExpression.R. Also fixes pre-existing
sampleNames typo bug in bambu.R. Drop exportClass(quantData) and
exportMethods("$") from NAMESPACE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nonuniqueCountMatrix is now computed on demand in bambu.R and
generateUniqueCountsSEFromQuantData, and stored only in the
uniqueCountsSe and final SE metadata. Moved generateNonUniqueCountMatrix
to bambu_utilityFunctions.R as it is shared across both call sites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cying111
Copy link
Copy Markdown
Collaborator

finished testing on different test datasets, and also different parameters, surprisingly, this also fixs a bug that readTranscriptMap not reported when readTranscriptMap is set to TRUE in devel_pre_v4.

lingminhao and others added 3 commits April 17, 2026 11:50
- revert transcriptToGeneExpression to use assays$counts + incompatibleCounts
  (+ nonuniqueCounts if present in metadata, for uniqueCounts SE path)
- rename summarizeExpression.R to transcriptToGeneExpression.R
- move generateNonUniqueCountMatrix to transcriptToGeneExpression.R
- remove nonuniqueCounts from bambu.quantify and combineCountSes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tag SE objects with metadata$seType to distinguish their content:
"uniqueCounts", "EMCounts", and "geneCounts"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@jonathangoeke jonathangoeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • test all changes
  • document downstream issues
  • add note on metadata slot for other SE outputs in bambu/create draft issue

Comment thread R/bambu.R Outdated
Comment thread R/bambu.R Outdated
Comment thread R/bambu.R Outdated
Comment thread R/bambu_utilityFunctions.R Outdated
Comment thread R/bambu-processReads.R
lingminhao and others added 2 commits April 17, 2026 15:52
Use levels(factor(unique(...))) to ensure genes vector is alphabetically
sorted, matching the GENEID.i index scheme assigned in bambu_utilityFunctions.R.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After decoding numeric indices to gene names, reorder the output matrix
to match annotation order so values align with the rownames assigned in
combineCountSes. Use drop=FALSE to preserve sparse matrix class for
single-column inputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lingminhao and others added 8 commits April 17, 2026 16:41
Replace rowSums+names with direct rownames/as.vector to avoid collapsing
multi-column subsets and to keep rownames attached as gene IDs for the
GENEID.i join in bambu.quantify.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lapply on a named quantData list preserves names, causing do.call(cbind/rbind)
to prepend the list element name (sampleName) to each column/row name with a
period separator — producing e.g. demultiplexed2.demultiplexed2_TCAGATGTCACCAGGC
instead of demultiplexed2_TCAGATGTCACCAGGC. unname() the lapply results before
cbind/rbind in generateUniqueCountsSEFromQuantData.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tyFunctions.R

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor]: tidy readClassList & quantData

4 participants