|
| 1 | +# Bambu — Codebase Guide for AI Agents |
| 2 | + |
| 3 | +## What bambu does |
| 4 | + |
| 5 | +Bambu is an R/Bioconductor package for **multi-sample transcript discovery and quantification** from long-read RNA-Seq data (Oxford Nanopore, PacBio). It takes aligned BAM files plus a reference genome and annotation (GTF/TxDb), and outputs a `SummarizedExperiment` with expression estimates for both known and novel transcripts and genes. |
| 6 | + |
| 7 | +The core challenge bambu solves: long reads are noisy (sequencing errors, alignment artefacts), so naive read-to-transcript assignment produces many false novel isoforms. Bambu addresses this through junction error correction, an XGBoost-based transcript scoring model, NDR (Novel Discovery Rate) control, and an EM algorithm for multi-mapping read assignment. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Pipeline overview |
| 12 | + |
| 13 | +``` |
| 14 | +BAM files |
| 15 | + │ |
| 16 | + ▼ |
| 17 | +[1] prepareAnnotations — convert GTF/TxDb into internal GRangesList format |
| 18 | + │ |
| 19 | + ▼ |
| 20 | +[2] bambu.processReads — per-sample: extract reads, correct junctions, |
| 21 | + │ build read classes, score with XGBoost |
| 22 | + ▼ |
| 23 | +[3] bambu.extendAnnotations — cross-sample: combine read class candidates, |
| 24 | + │ filter, assign NDR score, extend reference annotations |
| 25 | + ▼ |
| 26 | +[4] assignReadClasstoTranscripts — build equivalence classes, map read classes |
| 27 | + │ to transcripts (distance-based) |
| 28 | + ▼ |
| 29 | +[5] bambu.quantify — EM algorithm to resolve multi-mapping reads, |
| 30 | + compute counts / CPM / fullLengthCounts |
| 31 | + │ |
| 32 | + ▼ |
| 33 | +SummarizedExperiment output |
| 34 | +``` |
| 35 | + |
| 36 | +The top-level orchestrator is `bambu()` in [R/bambu.R](R/bambu.R). All stages are called from there. Stages 2–5 can be run per-sample in parallel via `BiocParallel`. |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +## Module map |
| 41 | + |
| 42 | +### Entry points (user-facing exported functions) |
| 43 | + |
| 44 | +| File | Function | Purpose | |
| 45 | +|------|----------|---------| |
| 46 | +| [R/bambu.R](R/bambu.R) | `bambu()` | Main entry point; orchestrates all pipeline stages | |
| 47 | +| [R/prepareAnnotations.R](R/prepareAnnotations.R) | `prepareAnnotations()` | Convert GTF/TxDb to `GRangesList` annotation object; must be called before `bambu()` | |
| 48 | +| [R/readWrite.R](R/readWrite.R) | `writeBambuOutput()`, `importBambuResults()`, `writeToGTF()`, `readFromGTF()` | Save/load bambu results to disk (GTF + count matrices) | |
| 49 | +| [R/plotBambu.R](R/plotBambu.R) | `plotBambu()` | Visualise expression and isoform structure | |
| 50 | +| [R/compareTranscripts.R](R/compareTranscripts.R) | `compareTranscripts()` | Compare alternatively-spliced transcripts between query and subject | |
| 51 | +| [R/transcriptToGeneExpression.R](R/transcriptToGeneExpression.R) | `transcriptToGeneExpression()` | Aggregate transcript-level SE to gene-level counts | |
| 52 | + |
| 53 | +### Module 1 — Annotation preparation |
| 54 | + |
| 55 | +| File | Key functions | Role | |
| 56 | +|------|--------------|------| |
| 57 | +| [R/prepareAnnotations.R](R/prepareAnnotations.R) | `prepareAnnotations()` | Loads GTF or TxDb, extracts exon-by-transcript GRangesList, adds metadata columns (`TXNAME`, `GENEID`, etc.) | |
| 58 | +| [R/prepareAnnotations_utilityFunctions.R](R/prepareAnnotations_utilityFunctions.R) | internal helpers | Intron extraction, annotation metadata formatting | |
| 59 | + |
| 60 | +### Module 2 — Read processing (per sample) |
| 61 | + |
| 62 | +| File | Key functions | Role | |
| 63 | +|------|--------------|------| |
| 64 | +| [R/bambu-processReads.R](R/bambu-processReads.R) | `bambu.processReads()`, `bambu.processReadsByFile()`, `bambu.readsByFile()`, `constructReadClasses()` | Top-level per-sample dispatcher; iterates over BAM files; calls junction table construction, error correction, read class construction | |
| 65 | +| [R/bambu-processReads_utilityCreateJunctionTables.R](R/bambu-processReads_utilityCreateJunctionTables.R) | `isore.constructJunctionTables()`, `createJunctionTable()`, `junctionStrandCorrection()` | Extract splice junctions from alignments; compute strand scores from splice motifs | |
| 66 | +| [R/bambu-processReads_utilityJunctionErrorCorrection.R](R/bambu-processReads_utilityJunctionErrorCorrection.R) | `junctionErrorCorrection()`, `fitXGBoostModel()`, `findHighConfidenceJunctions()` | Correct noisy splice junctions using an XGBoost model trained on annotated junctions; core noise-reduction step | |
| 67 | +| [R/bambu-processReads_utilityConstructReadClasses.R](R/bambu-processReads_utilityConstructReadClasses.R) | `isore.constructReadClasses()`, `constructSplicedReadClasses()`, `constructUnsplicedReadClasses()`, `assignGeneIds()` | Group corrected reads into read classes (unique exon-chain signatures); assign gene IDs by overlap with annotation | |
| 68 | +| [R/bambu-processReads_scoreReadClasses.R](R/bambu-processReads_scoreReadClasses.R) | `trainBambu()` + scoring functions | XGBoost-based transcript scoring; `trainBambu()` is exported for training a custom model on a different species/dataset | |
| 69 | +| [R/prepareDataFromBam.R](R/prepareDataFromBam.R) | `prepareDataFromBam()` | Low-level BAM → GRangesList reader; handles CIGAR parsing, clipping, chunked reading via `yieldSize` | |
| 70 | + |
| 71 | +### Module 3 — Annotation extension (cross-sample) |
| 72 | + |
| 73 | +| File | Key functions | Role | |
| 74 | +|------|--------------|------| |
| 75 | +| [R/bambu-extendAnnotations.R](R/bambu-extendAnnotations.R) | `bambu.extendAnnotations()` | Orchestrates cross-sample combination then extension; calls combine then extend | |
| 76 | +| [R/bambu-extendAnnotations-utilityCombine.R](R/bambu-extendAnnotations-utilityCombine.R) | `isore.combineTranscriptCandidates()`, `combineSplicedTranscriptModels()`, `extractFeaturesFromReadClassSE()` | Merge read class candidates across samples into a unified set; compute per-position start/end weighted medians | |
| 77 | +| [R/bambu-extendAnnotations-utilityExtend.R](R/bambu-extendAnnotations-utilityExtend.R) | `isore.extendAnnotations()`, `recommendNDR()`, `setNDR()`, `filterTranscriptsByAnnotation()`, `calculateDistToAnnotation()` | Filter candidates by NDR threshold; merge with reference; assign novel transcript IDs; `setNDR()` is exported to re-threshold post-hoc | |
| 78 | + |
| 79 | +### Module 4 — Read class to transcript assignment |
| 80 | + |
| 81 | +| File | Key functions | Role | |
| 82 | +|------|--------------|------| |
| 83 | +| [R/bambu_utilityFunctions.R](R/bambu_utilityFunctions.R) | `assignReadClasstoTranscripts()` (via `calculateDistTable()`), `combineCountSes()`, `generateColData()`, `checkInputs()`, `setBiocParallelParameters()` | Compute distance table mapping read classes to transcripts; build equivalence classes; validate inputs; manage parallelism | |
| 84 | +| [R/bambu-assignDist.R](R/bambu-assignDist.R) | `assignReadClasstoTranscripts()`, `generateUniqueCounts()`, `generateNonUniqueCounts()`, `generateIncompatibleCounts()` | Build count matrices (unique, non-unique, incompatible) from read class equivalence classes | |
| 85 | +| [R/bambu-quantify_utilityFunctions.R](R/bambu-quantify_utilityFunctions.R) | `genEquiRCs()`, `modifyIncompatibleAssignment()`, `processIncompatibleCounts()` | Generate equivalence read classes; handle reads incompatible with all transcripts in a gene | |
| 86 | + |
| 87 | +### Module 5 — Quantification |
| 88 | + |
| 89 | +| File | Key functions | Role | |
| 90 | +|------|--------------|------| |
| 91 | +| [R/bambu-quantify.R](R/bambu-quantify.R) | `bambu.quantify()`, `bambu.quantDT()` | Run EM per sample; aggregate counts; compute CPM; assemble sparse output vectors | |
| 92 | +| [src/em.cpp](src/em.cpp) | `em_theta()` (C++/Rcpp) | Expectation-Maximisation algorithm implemented in C++ with RcppArmadillo; resolves multi-mapping reads to transcripts using a probability matrix | |
| 93 | + |
| 94 | +### Module 6 — Visualization, comparison & output |
| 95 | + |
| 96 | +| File | Key functions | Role | |
| 97 | +|------|--------------|------| |
| 98 | +| [R/readWrite.R](R/readWrite.R) | `writeBambuOutput()`, `importBambuResults()`, `writeToGTF()`, `readFromGTF()` | Serialise/load bambu results (GTF + count matrices) | |
| 99 | +| [R/plotBambu.R](R/plotBambu.R) | `plotBambu()` | Visualise expression and isoform structure | |
| 100 | +| [R/plotBambu_utilityFunctions.R](R/plotBambu_utilityFunctions.R) | internal helpers | Plot construction helpers for `plotBambu()` | |
| 101 | +| [R/compareTranscripts.R](R/compareTranscripts.R) | `compareTranscripts()` | Compare alternatively-spliced transcripts between query and subject | |
| 102 | +| [R/compareTranscripts_utilityFunctions.R](R/compareTranscripts_utilityFunctions.R) | internal helpers | GRanges manipulation utilities for transcript comparison | |
| 103 | +| [R/transcriptToGeneExpression.R](R/transcriptToGeneExpression.R) | `transcriptToGeneExpression()` | Aggregate transcript-level SE to gene-level counts | |
| 104 | +| [R/transcriptToGeneExpression_utilityFunctions.R](R/transcriptToGeneExpression_utilityFunctions.R) | internal helpers | Helper functions for gene-level aggregation | |
| 105 | + |
| 106 | +### Shared utilities |
| 107 | + |
| 108 | +| File | Purpose | |
| 109 | +|------|---------| |
| 110 | +| [R/utility_spliceHelper_functions.R](R/utility_spliceHelper_functions.R) | Distance-based splice overlap finding (`findSpliceOverlapsByDist()`); used in annotation extension and assignment | |
| 111 | +| [R/globals.R](R/globals.R) | `globalVariables()` declarations to suppress R CMD check notes for data.table/dplyr column names | |
| 112 | +| [R/RcppExports.R](R/RcppExports.R) | Auto-generated Rcpp bindings — do not edit manually | |
| 113 | + |
| 114 | +--- |
| 115 | + |
| 116 | +## Key data structures |
| 117 | + |
| 118 | +- **`bambuAnnotation` / `GRangesList`** — exons grouped by transcript (`names` = transcript IDs); `mcols` carries `TXNAME`, `GENEID`, `NDR`, `newTx` flags. Created by `prepareAnnotations()`. |
| 119 | +- **Read class `SummarizedExperiment`** — one row per read class, one column per sample; `rowRanges` holds the exon-chain GRanges; `assays` include `counts` (read counts per class); `metadata` carries `countMatrix`, `incompatibleCountMatrix`, `readClassDist`, `eqClassById`. |
| 120 | +- **`readClassDt` (data.table)** — equivalence-class table with columns `txid`, `eqClassId`, `eqClassById`, `gene_sid`, `nobs`, `multi_align`, `aval`; fed directly to `em_theta()`. |
| 121 | +- **`distTable`** — mapping of read class to transcript(s) with distance scores; computed by `calculateDistTable()` and stored in `metadata(readClassList)$readClassDist`. |
| 122 | +- **Output `SummarizedExperiment`** — `rowRanges` = extended annotations; `assays` = `counts`, `CPM`, `fullLengthCounts`, `uniqueCounts`; `metadata` = `incompatibleCounts`, `nonuniqueCounts`, optionally `readToTranscriptMap` and `distTable`. |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## Important parameters |
| 127 | + |
| 128 | +- **`NDR`** (Novel Discovery Rate) — analogous to FDR for novel transcripts; default is auto-recommended based on `baselineFDR = 0.1`; can be re-applied post-hoc with `setNDR()`. |
| 129 | +- **`opt.discovery`** — list controlling isoform reconstruction (min read counts, min sample number, distance thresholds); see `setIsoreParameters()` in [R/bambu_utilityFunctions.R](R/bambu_utilityFunctions.R). |
| 130 | +- **`opt.em`** — list controlling EM convergence (`maxiter`, `conv`, `sig.digit`); see `setEmParameters()`. |
| 131 | +- **`rcFiles`** — read class files (`.rds`) saved mid-run; allow skipping `processReads` on re-runs. |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +## Code review guidelines |
| 136 | + |
| 137 | +Standard review concerns apply (correctness, tests, clarity, etc.). In |
| 138 | +addition, reviewers should check the bambu-specific items below. Extend this |
| 139 | +list as shared conventions emerge. |
| 140 | + |
| 141 | +### Bambu-specific review checklist |
| 142 | +- **Is `CLAUDE.md` still accurate given this change?** If the PR adds, |
| 143 | + removes, renames, or moves a file under `R/`, or materially changes what a |
| 144 | + module does, flag the `CLAUDE.md` sections that would need updating (module |
| 145 | + map tables, pipeline overview, key data structures, important parameters). |
| 146 | + Suggest the wording in a review comment; do not edit `CLAUDE.md` as part of |
| 147 | + the review. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## On-request annotation passes |
| 152 | + |
| 153 | +The two passes below are **opt-in**: only run them when the user explicitly |
| 154 | +asks ("tag code issues", "add function headers", etc.). They produce noisy |
| 155 | +diffs and are not part of normal development. |
| 156 | + |
| 157 | +### Pass 1 — Tag code issues with typed `TODO:` comments |
| 158 | +Surface possible bugs, dead code, and poor naming inline. Do **not** fix the |
| 159 | +issues — just flag them. Place each comment on the line **above** the flagged |
| 160 | +code. Use one of these typed prefixes so findings are greppable: |
| 161 | + |
| 162 | +- `# TODO: [BUG] ...` — possible bugs (off-by-one, NULL deref, wrong operator |
| 163 | + precedence, vacuous `all()` on empty input, etc.) |
| 164 | +- `# TODO: [UNUSED CODE] ...` — unreachable code after `return()`, unused |
| 165 | + variables, functions defined but never called, large commented-out blocks |
| 166 | +- `# TODO: [POOR NAMING] ...` — generic names (`temp`, `tmp`, `x`, `ov`, |
| 167 | + `length_tmp`); the same single letter reused for different concepts |
| 168 | +- `# TODO: [OTHER] ...` — magic numbers, debug `print()`s, typos, misleading |
| 169 | + comments |
| 170 | + |
| 171 | +### Pass 2 — Add function header blocks with module & call counts |
| 172 | +Insert a header immediately above every named top-level function definition |
| 173 | +(above the `#'` roxygen block if one exists). Format: |
| 174 | + |
| 175 | +```r |
| 176 | +# --- functionName --- |
| 177 | +# Module: [module name] | [filename.R] |
| 178 | +# Called by: caller1.R, caller2.R |
| 179 | +# Call count: N calls, M files |
| 180 | +``` |
| 181 | + |
| 182 | +For functions with no internal callers, use |
| 183 | +`# Call count: 0 internal calls (exported or not called internally)` and |
| 184 | +`# Called by: (not called anywhere — user-facing entry point)` for exported |
| 185 | +ones. |
| 186 | + |
| 187 | +**Call-graph accuracy** — when counting callers, **exclude matches inside |
| 188 | +roxygen `#'` blocks, regular comments, and string literals**. A naïve |
| 189 | +`grep 'funcName('` over-counts every `@seealso`, `@examples`, and |
| 190 | +`message("… funcName(…) …")`. Filter lines whose first non-whitespace |
| 191 | +character is `#` before counting, and sanity-check matches preceded by an |
| 192 | +unclosed `"` on the same line. Exported entry points (`bambu`, `plotBambu`, |
| 193 | +`writeBambuOutput`, `readFromGTF`, `importBambuResults`, `compareTranscripts`, |
| 194 | +`prepareAnnotations`, `trainBambu`, `setNDR`, `writeToGTF`, |
| 195 | +`transcriptToGeneExpression`) are user-facing — their roxygen examples and |
| 196 | +`@seealso` references are not call sites. |
| 197 | + |
| 198 | +--- |
| 199 | + |
| 200 | +## Development notes |
| 201 | + |
| 202 | +- **R package conventions**: uses roxygen2 for documentation; internal functions are tagged `@noRd`. Run `devtools::document()` after changing roxygen headers. |
| 203 | +- **C++ code**: `src/em.cpp` uses RcppArmadillo; rebuild with `devtools::compileAttributes()` then `devtools::build()`. `R/RcppExports.R` and `src/RcppExports.cpp` are auto-generated. |
| 204 | +- **Tests**: located in [tests/testthat/](tests/testthat/); run with `devtools::test()`. Test data (small chr9 region) is in `inst/extdata/`. |
| 205 | +- **Bioconductor branch**: `devel_pre_v4` is the active development branch; `devel` is the main integration branch used for PRs. |
| 206 | +- **Parallelism**: `BiocParallel` is used for multi-sample processing; `bpParameters` is configured in `setBiocParallelParameters()` based on `ncore` and platform. |
0 commit comments