|
| 1 | +--- |
| 2 | +title: "Enhance and Develop GeneROOT Infrastructure" |
| 3 | +layout: post |
| 4 | +excerpt: "Continuing the GeneROOT project: expanding benchmark suite, optimizing indexing, evaluating compression algorithms, and bringing more SAMtools features to RAMtools." |
| 5 | +sitemap: false |
| 6 | +author: Jeffrey Zhang |
| 7 | +permalink: blogs/generoot_jeffrey_zhang_blog/ |
| 8 | +banner_image: /images/blog/generoot_project_banner.png |
| 9 | +date: 2026-05-27 |
| 10 | +tags: c++ genome bioinformatics root rntuple |
| 11 | +--- |
| 12 | + |
| 13 | +## Introduction |
| 14 | + |
| 15 | +My name is Jeffrey Zhang, and I'm a third-year B.S. undergraduate student studying Physics at Nagoya |
| 16 | +University, Japan. I'll be working on extending the GeneROOT infrastructure, building directly on the foundation |
| 17 | +laid by Aditya Pandey during his GSoC 2025 work on using ROOT in genome |
| 18 | +sequencing. |
| 19 | + |
| 20 | +**Mentors**: Martin Vassilev, Vassil Vassilev, Aaron Jomy |
| 21 | + |
| 22 | +## Overview |
| 23 | + |
| 24 | +Large-scale biological data, such as a fully sequenced human genome, typically occupies $\sim$500 GB. Analyzing such datasets for research involves data volumes that exceed petabytes. Handling data at this scale requires a highly robust underlying software infrastructure. To meet this challenge, the GeneROOT project draws on CERN's extensive expertise in managing massive physics datasets through its columnar-based ROOT software framework. The GeneROOT project aims to adapt this framework specifically for processing biological data. |
| 25 | + |
| 26 | +During the [2025 GeneROOT GSoC](https://compiler-research.org/blogs/GSoC25_aditya_pandey_final_blog/) project, Aditya Pandey established the RNTuple data model for genome sequences. It currently supports region queries, conversion from SAM to RNTuple, and a benchmark comparison against the industry-standard CRAM format on a single test sample `HG00154` from the 1000 Genomes Project. |
| 27 | + |
| 28 | +However, the results reveal several limitations. The benchmark suite relies on a single low-coverage sample with hard-coded file paths, which is insufficient for a credible comparison with tools such as SAMtools and CRAM. In terms of performance, RNTuple's index lookup itself performs a linear scan that does not scale to production-sized datasets. In terms of functionality, RAMtools cannot currently export records back to SAM, has no merge operation to complement the chromosome splitter, no sort, and no statistics tools. These gaps leave RAMtools as a proof of concept rather than a usable pipeline component. |
| 29 | + |
| 30 | +My project builds on that foundation by expanding benchmark suite, optimizing indexing, evaluating compression algorithms, and bringing more SAMtools features to RAMtools. |
| 31 | + |
| 32 | +## Technical Implementation |
| 33 | + |
| 34 | +The work breaks into five tasks: |
| 35 | + |
| 36 | +1. **Benchmark on heavy bioinformatics datasets.** Refactor the benchmark |
| 37 | + suite, replace hard-coded paths with a `benchmark_config.h` and |
| 38 | + CLI-driven dataset selection, run against well-known reference samples |
| 39 | + (`HG001`–`HG007`), and capture more metrics such as memory usage in addition to |
| 40 | + timing metrics. |
| 41 | + |
| 42 | +2. **Cross-format comparison.** Extend the `system()`-call approach already |
| 43 | + used in `chromosome_split_benchmark.cxx` so all benchmark scripts measure |
| 44 | + SAM, BAM, and CRAM against RAMtools/RNTuple on the same datasets. |
| 45 | + |
| 46 | +3. **Genomic compression algorithms.** Evaluate modern quality-score |
| 47 | + compression schemes (Crumble, QVZ, CALQ, P-block), extend the |
| 48 | + `EQualCompressionBits` enum in `RAMNTupleRecord.h`, and add the most |
| 49 | + effective candidates as new quality policies. |
| 50 | + |
| 51 | +4. **Indexing and search optimizations.** `GetRowsInRange()` currently does |
| 52 | + an O(N) linear scan; I'll replace it with an O(log N) binary search over |
| 53 | + a sorted `fIndex` (eliminating the redundant `fIndexMap`/`RebuildMap` |
| 54 | + pair), have `kPositionInterval` and `kMappedInterval` as configurable |
| 55 | + parameters, and implement a no-index columnar query fallback in |
| 56 | + `RAMNTupleView.cxx` similar to legacy TTree `ramview_no_index.cxx`. |
| 57 | + |
| 58 | +5. **Add common SAMtools features to RAMtools.** Add `ramntuplestats`, |
| 59 | + `ramntupleidxstat`, and `ramntupleflagstat`; complete `ramntupleview` |
| 60 | + with N-record, region-filtering, and selective-column output; and add |
| 61 | + `ramntuplesplit`, `ramntuplemerge`, and `ramntuplesort`. |
| 62 | + |
| 63 | +## Goals |
| 64 | + |
| 65 | +By the end of the coding period I aim to have: |
| 66 | + |
| 67 | +- A reproducible benchmark suite that runs against multiple |
| 68 | + genomic datasets with custom commands and outputs. |
| 69 | +- Quantitative cross-format comparisons against SAM, BAM, and CRAM. |
| 70 | +- A measurable storage-efficiency improvement on `QUAL` data using modern |
| 71 | + compression algorithms. |
| 72 | +- Faster region queries that can scale better to production-sized datasets. |
| 73 | +- RAMtools at feature parity with some of the commonly used functionalities in SAMtools, such as `Stat` and `View`. |
| 74 | + |
| 75 | +The combined effect should move RAMtools from a working proof of concept |
| 76 | +toward a more usable component of a real genomics pipeline. |
0 commit comments