Skip to content

Commit 2047190

Browse files
committed
add comments to code
1 parent fdf8db5 commit 2047190

25 files changed

Lines changed: 1012 additions & 107 deletions

CLAUDE.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Bambu — Codebase Guide for AI Agents
2+
3+
## What bambu does
4+
5+
Bambu is an R/Bioconductor package for **multi-sample transcript discovery and quantification** from long-read RNA-Seq data (Oxford Nanopore, PacBio). It takes aligned BAM files plus a reference genome and annotation (GTF/TxDb), and outputs a `SummarizedExperiment` with expression estimates for both known and novel transcripts and genes.
6+
7+
The core challenge bambu solves: long reads are noisy (sequencing errors, alignment artefacts), so naive read-to-transcript assignment produces many false novel isoforms. Bambu addresses this through junction error correction, an XGBoost-based transcript scoring model, NDR (Novel Discovery Rate) control, and an EM algorithm for multi-mapping read assignment.
8+
9+
---
10+
11+
## Pipeline overview
12+
13+
```
14+
BAM files
15+
16+
17+
[1] prepareAnnotations — convert GTF/TxDb into internal GRangesList format
18+
19+
20+
[2] bambu.processReads — per-sample: extract reads, correct junctions,
21+
│ build read classes, score with XGBoost
22+
23+
[3] bambu.extendAnnotations — cross-sample: combine read class candidates,
24+
│ filter, assign NDR score, extend reference annotations
25+
26+
[4] assignReadClasstoTranscripts — build equivalence classes, map read classes
27+
│ to transcripts (distance-based)
28+
29+
[5] bambu.quantify — EM algorithm to resolve multi-mapping reads,
30+
compute counts / CPM / fullLengthCounts
31+
32+
33+
SummarizedExperiment output
34+
```
35+
36+
The top-level orchestrator is `bambu()` in [R/bambu.R](R/bambu.R). All stages are called from there. Stages 2–5 can be run per-sample in parallel via `BiocParallel`.
37+
38+
---
39+
40+
## Module map
41+
42+
### Entry points (user-facing exported functions)
43+
44+
| File | Function | Purpose |
45+
|------|----------|---------|
46+
| [R/bambu.R](R/bambu.R) | `bambu()` | Main entry point; orchestrates all pipeline stages |
47+
| [R/prepareAnnotations.R](R/prepareAnnotations.R) | `prepareAnnotations()` | Convert GTF/TxDb to `GRangesList` annotation object; must be called before `bambu()` |
48+
| [R/readWrite.R](R/readWrite.R) | `writeBambuOutput()`, `importBambuResults()`, `writeToGTF()`, `readFromGTF()` | Save/load bambu results to disk (GTF + count matrices) |
49+
| [R/plotBambu.R](R/plotBambu.R) | `plotBambu()` | Visualise expression and isoform structure |
50+
| [R/compareTranscripts.R](R/compareTranscripts.R) | `compareTranscripts()` | Compare alternatively-spliced transcripts between query and subject |
51+
| [R/transcriptToGeneExpression.R](R/transcriptToGeneExpression.R) | `transcriptToGeneExpression()` | Aggregate transcript-level SE to gene-level counts |
52+
53+
### Module 1 — Annotation preparation
54+
55+
| File | Key functions | Role |
56+
|------|--------------|------|
57+
| [R/prepareAnnotations.R](R/prepareAnnotations.R) | `prepareAnnotations()` | Loads GTF or TxDb, extracts exon-by-transcript GRangesList, adds metadata columns (`TXNAME`, `GENEID`, etc.) |
58+
| [R/prepareAnnotations_utilityFunctions.R](R/prepareAnnotations_utilityFunctions.R) | internal helpers | Intron extraction, annotation metadata formatting |
59+
60+
### Module 2 — Read processing (per sample)
61+
62+
| File | Key functions | Role |
63+
|------|--------------|------|
64+
| [R/bambu-processReads.R](R/bambu-processReads.R) | `bambu.processReads()`, `bambu.processReadsByFile()`, `bambu.readsByFile()`, `constructReadClasses()` | Top-level per-sample dispatcher; iterates over BAM files; calls junction table construction, error correction, read class construction |
65+
| [R/bambu-processReads_utilityCreateJunctionTables.R](R/bambu-processReads_utilityCreateJunctionTables.R) | `isore.constructJunctionTables()`, `createJunctionTable()`, `junctionStrandCorrection()` | Extract splice junctions from alignments; compute strand scores from splice motifs |
66+
| [R/bambu-processReads_utilityJunctionErrorCorrection.R](R/bambu-processReads_utilityJunctionErrorCorrection.R) | `junctionErrorCorrection()`, `fitXGBoostModel()`, `findHighConfidenceJunctions()` | Correct noisy splice junctions using an XGBoost model trained on annotated junctions; core noise-reduction step |
67+
| [R/bambu-processReads_utilityConstructReadClasses.R](R/bambu-processReads_utilityConstructReadClasses.R) | `isore.constructReadClasses()`, `constructSplicedReadClasses()`, `constructUnsplicedReadClasses()`, `assignGeneIds()` | Group corrected reads into read classes (unique exon-chain signatures); assign gene IDs by overlap with annotation |
68+
| [R/bambu-processReads_scoreReadClasses.R](R/bambu-processReads_scoreReadClasses.R) | `trainBambu()` + scoring functions | XGBoost-based transcript scoring; `trainBambu()` is exported for training a custom model on a different species/dataset |
69+
| [R/prepareDataFromBam.R](R/prepareDataFromBam.R) | `prepareDataFromBam()` | Low-level BAM → GRangesList reader; handles CIGAR parsing, clipping, chunked reading via `yieldSize` |
70+
71+
### Module 3 — Annotation extension (cross-sample)
72+
73+
| File | Key functions | Role |
74+
|------|--------------|------|
75+
| [R/bambu-extendAnnotations.R](R/bambu-extendAnnotations.R) | `bambu.extendAnnotations()` | Orchestrates cross-sample combination then extension; calls combine then extend |
76+
| [R/bambu-extendAnnotations-utilityCombine.R](R/bambu-extendAnnotations-utilityCombine.R) | `isore.combineTranscriptCandidates()`, `combineSplicedTranscriptModels()`, `extractFeaturesFromReadClassSE()` | Merge read class candidates across samples into a unified set; compute per-position start/end weighted medians |
77+
| [R/bambu-extendAnnotations-utilityExtend.R](R/bambu-extendAnnotations-utilityExtend.R) | `isore.extendAnnotations()`, `recommendNDR()`, `setNDR()`, `filterTranscriptsByAnnotation()`, `calculateDistToAnnotation()` | Filter candidates by NDR threshold; merge with reference; assign novel transcript IDs; `setNDR()` is exported to re-threshold post-hoc |
78+
79+
### Module 4 — Read class to transcript assignment
80+
81+
| File | Key functions | Role |
82+
|------|--------------|------|
83+
| [R/bambu_utilityFunctions.R](R/bambu_utilityFunctions.R) | `assignReadClasstoTranscripts()` (via `calculateDistTable()`), `combineCountSes()`, `generateColData()`, `checkInputs()`, `setBiocParallelParameters()` | Compute distance table mapping read classes to transcripts; build equivalence classes; validate inputs; manage parallelism |
84+
| [R/bambu-assignDist.R](R/bambu-assignDist.R) | `assignReadClasstoTranscripts()`, `generateUniqueCounts()`, `generateNonUniqueCounts()`, `generateIncompatibleCounts()` | Build count matrices (unique, non-unique, incompatible) from read class equivalence classes |
85+
| [R/bambu-quantify_utilityFunctions.R](R/bambu-quantify_utilityFunctions.R) | `genEquiRCs()`, `modifyIncompatibleAssignment()`, `processIncompatibleCounts()` | Generate equivalence read classes; handle reads incompatible with all transcripts in a gene |
86+
87+
### Module 5 — Quantification
88+
89+
| File | Key functions | Role |
90+
|------|--------------|------|
91+
| [R/bambu-quantify.R](R/bambu-quantify.R) | `bambu.quantify()`, `bambu.quantDT()` | Run EM per sample; aggregate counts; compute CPM; assemble sparse output vectors |
92+
| [src/em.cpp](src/em.cpp) | `em_theta()` (C++/Rcpp) | Expectation-Maximisation algorithm implemented in C++ with RcppArmadillo; resolves multi-mapping reads to transcripts using a probability matrix |
93+
94+
### Module 6 — Visualization, comparison & output
95+
96+
| File | Key functions | Role |
97+
|------|--------------|------|
98+
| [R/readWrite.R](R/readWrite.R) | `writeBambuOutput()`, `importBambuResults()`, `writeToGTF()`, `readFromGTF()` | Serialise/load bambu results (GTF + count matrices) |
99+
| [R/plotBambu.R](R/plotBambu.R) | `plotBambu()` | Visualise expression and isoform structure |
100+
| [R/plotBambu_utilityFunctions.R](R/plotBambu_utilityFunctions.R) | internal helpers | Plot construction helpers for `plotBambu()` |
101+
| [R/compareTranscripts.R](R/compareTranscripts.R) | `compareTranscripts()` | Compare alternatively-spliced transcripts between query and subject |
102+
| [R/compareTranscripts_utilityFunctions.R](R/compareTranscripts_utilityFunctions.R) | internal helpers | GRanges manipulation utilities for transcript comparison |
103+
| [R/transcriptToGeneExpression.R](R/transcriptToGeneExpression.R) | `transcriptToGeneExpression()` | Aggregate transcript-level SE to gene-level counts |
104+
| [R/transcriptToGeneExpression_utilityFunctions.R](R/transcriptToGeneExpression_utilityFunctions.R) | internal helpers | Helper functions for gene-level aggregation |
105+
106+
### Shared utilities
107+
108+
| File | Purpose |
109+
|------|---------|
110+
| [R/utility_spliceHelper_functions.R](R/utility_spliceHelper_functions.R) | Distance-based splice overlap finding (`findSpliceOverlapsByDist()`); used in annotation extension and assignment |
111+
| [R/globals.R](R/globals.R) | `globalVariables()` declarations to suppress R CMD check notes for data.table/dplyr column names |
112+
| [R/RcppExports.R](R/RcppExports.R) | Auto-generated Rcpp bindings — do not edit manually |
113+
114+
---
115+
116+
## Key data structures
117+
118+
- **`bambuAnnotation` / `GRangesList`** — exons grouped by transcript (`names` = transcript IDs); `mcols` carries `TXNAME`, `GENEID`, `NDR`, `newTx` flags. Created by `prepareAnnotations()`.
119+
- **Read class `SummarizedExperiment`** — one row per read class, one column per sample; `rowRanges` holds the exon-chain GRanges; `assays` include `counts` (read counts per class); `metadata` carries `countMatrix`, `incompatibleCountMatrix`, `readClassDist`, `eqClassById`.
120+
- **`readClassDt` (data.table)** — equivalence-class table with columns `txid`, `eqClassId`, `eqClassById`, `gene_sid`, `nobs`, `multi_align`, `aval`; fed directly to `em_theta()`.
121+
- **`distTable`** — mapping of read class to transcript(s) with distance scores; computed by `calculateDistTable()` and stored in `metadata(readClassList)$readClassDist`.
122+
- **Output `SummarizedExperiment`**`rowRanges` = extended annotations; `assays` = `counts`, `CPM`, `fullLengthCounts`, `uniqueCounts`; `metadata` = `incompatibleCounts`, `nonuniqueCounts`, optionally `readToTranscriptMap` and `distTable`.
123+
124+
---
125+
126+
## Important parameters
127+
128+
- **`NDR`** (Novel Discovery Rate) — analogous to FDR for novel transcripts; default is auto-recommended based on `baselineFDR = 0.1`; can be re-applied post-hoc with `setNDR()`.
129+
- **`opt.discovery`** — list controlling isoform reconstruction (min read counts, min sample number, distance thresholds); see `setIsoreParameters()` in [R/bambu_utilityFunctions.R](R/bambu_utilityFunctions.R).
130+
- **`opt.em`** — list controlling EM convergence (`maxiter`, `conv`, `sig.digit`); see `setEmParameters()`.
131+
- **`rcFiles`** — read class files (`.rds`) saved mid-run; allow skipping `processReads` on re-runs.
132+
133+
---
134+
135+
## Code review guidelines
136+
137+
Standard review concerns apply (correctness, tests, clarity, etc.). In
138+
addition, reviewers should check the bambu-specific items below. Extend this
139+
list as shared conventions emerge.
140+
141+
### Bambu-specific review checklist
142+
- **Is `CLAUDE.md` still accurate given this change?** If the PR adds,
143+
removes, renames, or moves a file under `R/`, or materially changes what a
144+
module does, flag the `CLAUDE.md` sections that would need updating (module
145+
map tables, pipeline overview, key data structures, important parameters).
146+
Suggest the wording in a review comment; do not edit `CLAUDE.md` as part of
147+
the review.
148+
149+
---
150+
151+
## On-request annotation passes
152+
153+
The two passes below are **opt-in**: only run them when the user explicitly
154+
asks ("tag code issues", "add function headers", etc.). They produce noisy
155+
diffs and are not part of normal development.
156+
157+
### Pass 1 — Tag code issues with typed `TODO:` comments
158+
Surface possible bugs, dead code, and poor naming inline. Do **not** fix the
159+
issues — just flag them. Place each comment on the line **above** the flagged
160+
code. Use one of these typed prefixes so findings are greppable:
161+
162+
- `# TODO: [BUG] ...` — possible bugs (off-by-one, NULL deref, wrong operator
163+
precedence, vacuous `all()` on empty input, etc.)
164+
- `# TODO: [UNUSED CODE] ...` — unreachable code after `return()`, unused
165+
variables, functions defined but never called, large commented-out blocks
166+
- `# TODO: [POOR NAMING] ...` — generic names (`temp`, `tmp`, `x`, `ov`,
167+
`length_tmp`); the same single letter reused for different concepts
168+
- `# TODO: [OTHER] ...` — magic numbers, debug `print()`s, typos, misleading
169+
comments
170+
171+
### Pass 2 — Add function header blocks with module & call counts
172+
Insert a header immediately above every named top-level function definition
173+
(above the `#'` roxygen block if one exists). Format:
174+
175+
```r
176+
# --- functionName ---
177+
# Module: [module name] | [filename.R]
178+
# Called by: caller1.R, caller2.R
179+
# Call count: N calls, M files
180+
```
181+
182+
For functions with no internal callers, use
183+
`# Call count: 0 internal calls (exported or not called internally)` and
184+
`# Called by: (not called anywhere — user-facing entry point)` for exported
185+
ones.
186+
187+
**Call-graph accuracy** — when counting callers, **exclude matches inside
188+
roxygen `#'` blocks, regular comments, and string literals**. A naïve
189+
`grep 'funcName('` over-counts every `@seealso`, `@examples`, and
190+
`message("… funcName(…) …")`. Filter lines whose first non-whitespace
191+
character is `#` before counting, and sanity-check matches preceded by an
192+
unclosed `"` on the same line. Exported entry points (`bambu`, `plotBambu`,
193+
`writeBambuOutput`, `readFromGTF`, `importBambuResults`, `compareTranscripts`,
194+
`prepareAnnotations`, `trainBambu`, `setNDR`, `writeToGTF`,
195+
`transcriptToGeneExpression`) are user-facing — their roxygen examples and
196+
`@seealso` references are not call sites.
197+
198+
---
199+
200+
## Development notes
201+
202+
- **R package conventions**: uses roxygen2 for documentation; internal functions are tagged `@noRd`. Run `devtools::document()` after changing roxygen headers.
203+
- **C++ code**: `src/em.cpp` uses RcppArmadillo; rebuild with `devtools::compileAttributes()` then `devtools::build()`. `R/RcppExports.R` and `src/RcppExports.cpp` are auto-generated.
204+
- **Tests**: located in [tests/testthat/](tests/testthat/); run with `devtools::test()`. Test data (small chr9 region) is in `inst/extdata/`.
205+
- **Bioconductor branch**: `devel_pre_v4` is the active development branch; `devel` is the main integration branch used for PRs.
206+
- **Parallelism**: `BiocParallel` is used for multi-sample processing; `bpParameters` is configured in `setBiocParallelParameters()` based on `ncore` and platform.

R/bambu-assignDist.R

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# --- assignReadClasstoTranscripts ---
2+
# Module: Module 4 — Read class to transcript assignment | bambu-assignDist.R
3+
# Called by: bambu.R
4+
# Call count: 1 call, 1 file
15
#' Create equivilence classes and assign to transcripts
26
#' @inheritParams bambu
37
#' @import data.table
@@ -46,9 +50,14 @@ assignReadClasstoTranscripts <- function(readClassList, annotations, isoreParame
4650

4751
}
4852

53+
# --- generateUniqueCounts ---
54+
# Module: Module 4 — Read class to transcript assignment | bambu-assignDist.R
55+
# Called by: bambu-assignDist.R
56+
# Call count: 1 call, 1 file
4957
#' Generate unique counts
5058
#' @noRd
5159
generateUniqueCounts <- function(readClassDt, countMatrix, annotations){
60+
# TODO: [POOR NAMING] x is used to store filtered unique read classes; rename to uniqueReadClassDt or similar
5261
x <- readClassDt %>% filter(!multi_align & !is.na(eqClass.match))
5362
uniqueCounts <- countMatrix[x$eqClass.match,]
5463
uniqueCounts.tx <- sparse.model.matrix(~ factor(x$txid) - 1)
@@ -58,15 +67,19 @@ generateUniqueCounts <- function(readClassDt, countMatrix, annotations){
5867
rownames(counts) <- names(annotations)
5968
counts[rownames(uniqueCounts),] <- uniqueCounts
6069
return(counts)
61-
62-
# these three lines appear after return, so it's not used, is this used for debug only?
70+
71+
# TODO: [UNUSED CODE] the three lines below are unreachable (after return); remove them
6372
# counts.total = colSums(countMatrix) + colSums(incompatibleCountMatrix)
6473
# counts.total[counts.total==0] = 1
6574
# counts.CPM = counts/counts.total * 10^6
6675

6776
}
6877

6978

79+
# --- generateIncompatibleCounts ---
80+
# Module: Module 4 — Read class to transcript assignment | bambu-assignDist.R
81+
# Called by: bambu-assignDist.R
82+
# Call count: 1 call, 1 file
7083
#' Generate incompatible counts
7184
#' @noRd
7285
generateIncompatibleCounts <- function(incompatibleCountMatrix, annotations){
@@ -79,10 +92,15 @@ generateIncompatibleCounts <- function(incompatibleCountMatrix, annotations){
7992
}
8093

8194

95+
# --- generateNonUniqueCounts ---
96+
# Module: Module 4 — Read class to transcript assignment | bambu-assignDist.R
97+
# Called by: bambu-assignDist.R
98+
# Call count: 1 call, 1 file
8299
#' Generate non-unique counts
83100
#' @noRd
84101
generateNonUniqueCounts <- function(readClassDt, countMatrix, annotations){
85102
#fuse multi align RCs by gene
103+
# TODO: [POOR NAMING] x reused with a different meaning than in generateUniqueCounts (multi-aligned reads); rename to multiAlignReadClassDt
86104
x <- readClassDt %>% filter(multi_align & !is.na(eqClass.match))
87105
x <- x %>% distinct(eqClassId, .keep_all = TRUE)
88106
nonuniqueCounts <- countMatrix[x$eqClass.match,, drop = FALSE]

0 commit comments

Comments
 (0)