Skip to content

Commit a0a3621

Browse files
authored
Add GeneROOT project's contributor Intro Blog (#404)
* spelling correction * add intro blog * blog edit * added generoot vocab to terms.txt
1 parent 958ed36 commit a0a3621

5 files changed

Lines changed: 91 additions & 4 deletions

File tree

.github/actions/spelling/allow/names.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ Ilieva
4242
Isemann
4343
JLange
4444
JRembser
45+
Jeffrey
4546
Jiayang
4647
Jomy
4748
Joshi
@@ -111,6 +112,7 @@ Yehor
111112
Yuka
112113
Yuquan
113114
Zarytskyi
115+
Zhang
114116
zarytskyi
115117
aaronj
116118
aaronjomyjoseph
@@ -171,6 +173,7 @@ isaacmoralessantana
171173
izvekov
172174
jacklqiu
173175
jeaye
176+
jeffrey
174177
jiayang
175178
jiayangli
176179
jomy

.github/actions/spelling/allow/terms.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,8 +179,20 @@ Oncoprotein
179179
oncoprotein
180180
organoids
181181
paraview
182+
CALQ
182183
CARTOPIAXROOT
184+
Mtools
183185
pld
186+
QVZ
187+
generoot
188+
petabytes
189+
RAMN
190+
ramntupleflagstat
191+
ramntupleidxstat
192+
ramntuplemerge
193+
ramntuplesort
194+
ramntuplesplit
195+
ramntuplestats
184196
CAFs
185197
downregulating
186198
Fibroblasts

.github/actions/spelling/expect.txt

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@ gpt
2525
ibrahim
2626
inp
2727
jank
28-
Jeffrey
29-
jeffrey
3028
Karpathy
3129
Kiril
3230
lang
@@ -63,5 +61,3 @@ upstreaming
6361
usecases
6462
USINGSTDCPP
6563
vedant
66-
Zhang
67-
zhang
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
title: "Enhance and Develop GeneROOT Infrastructure"
3+
layout: post
4+
excerpt: "Continuing the GeneROOT project: expanding benchmark suite, optimizing indexing, evaluating compression algorithms, and bringing more SAMtools features to RAMtools."
5+
sitemap: false
6+
author: Jeffrey Zhang
7+
permalink: blogs/generoot_jeffrey_zhang_blog/
8+
banner_image: /images/blog/generoot_project_banner.png
9+
date: 2026-05-27
10+
tags: c++ genome bioinformatics root rntuple
11+
---
12+
13+
## Introduction
14+
15+
My name is Jeffrey Zhang, and I'm a third-year B.S. undergraduate student studying Physics at Nagoya
16+
University, Japan. I'll be working on extending the GeneROOT infrastructure, building directly on the foundation
17+
laid by Aditya Pandey during his GSoC 2025 work on using ROOT in genome
18+
sequencing.
19+
20+
**Mentors**: Martin Vassilev, Vassil Vassilev, Aaron Jomy
21+
22+
## Overview
23+
24+
Large-scale biological data, such as a fully sequenced human genome, typically occupies $\sim$500 GB. Analyzing such datasets for research involves data volumes that exceed petabytes. Handling data at this scale requires a highly robust underlying software infrastructure. To meet this challenge, the GeneROOT project draws on CERN's extensive expertise in managing massive physics datasets through its columnar-based ROOT software framework. The GeneROOT project aims to adapt this framework specifically for processing biological data.
25+
26+
During the [2025 GeneROOT GSoC](https://compiler-research.org/blogs/GSoC25_aditya_pandey_final_blog/) project, Aditya Pandey established the RNTuple data model for genome sequences. It currently supports region queries, conversion from SAM to RNTuple, and a benchmark comparison against the industry-standard CRAM format on a single test sample `HG00154` from the 1000 Genomes Project.
27+
28+
However, the results reveal several limitations. The benchmark suite relies on a single low-coverage sample with hard-coded file paths, which is insufficient for a credible comparison with tools such as SAMtools and CRAM. In terms of performance, RNTuple's index lookup itself performs a linear scan that does not scale to production-sized datasets. In terms of functionality, RAMtools cannot currently export records back to SAM, has no merge operation to complement the chromosome splitter, no sort, and no statistics tools. These gaps leave RAMtools as a proof of concept rather than a usable pipeline component.
29+
30+
My project builds on that foundation by expanding benchmark suite, optimizing indexing, evaluating compression algorithms, and bringing more SAMtools features to RAMtools.
31+
32+
## Technical Implementation
33+
34+
The work breaks into five tasks:
35+
36+
1. **Benchmark on heavy bioinformatics datasets.** Refactor the benchmark
37+
suite, replace hard-coded paths with a `benchmark_config.h` and
38+
CLI-driven dataset selection, run against well-known reference samples
39+
(`HG001``HG007`), and capture more metrics such as memory usage in addition to
40+
timing metrics.
41+
42+
2. **Cross-format comparison.** Extend the `system()`-call approach already
43+
used in `chromosome_split_benchmark.cxx` so all benchmark scripts measure
44+
SAM, BAM, and CRAM against RAMtools/RNTuple on the same datasets.
45+
46+
3. **Genomic compression algorithms.** Evaluate modern quality-score
47+
compression schemes (Crumble, QVZ, CALQ, P-block), extend the
48+
`EQualCompressionBits` enum in `RAMNTupleRecord.h`, and add the most
49+
effective candidates as new quality policies.
50+
51+
4. **Indexing and search optimizations.** `GetRowsInRange()` currently does
52+
an O(N) linear scan; I'll replace it with an O(log N) binary search over
53+
a sorted `fIndex` (eliminating the redundant `fIndexMap`/`RebuildMap`
54+
pair), have `kPositionInterval` and `kMappedInterval` as configurable
55+
parameters, and implement a no-index columnar query fallback in
56+
`RAMNTupleView.cxx` similar to legacy TTree `ramview_no_index.cxx`.
57+
58+
5. **Add common SAMtools features to RAMtools.** Add `ramntuplestats`,
59+
`ramntupleidxstat`, and `ramntupleflagstat`; complete `ramntupleview`
60+
with N-record, region-filtering, and selective-column output; and add
61+
`ramntuplesplit`, `ramntuplemerge`, and `ramntuplesort`.
62+
63+
## Goals
64+
65+
By the end of the coding period I aim to have:
66+
67+
- A reproducible benchmark suite that runs against multiple
68+
genomic datasets with custom commands and outputs.
69+
- Quantitative cross-format comparisons against SAM, BAM, and CRAM.
70+
- A measurable storage-efficiency improvement on `QUAL` data using modern
71+
compression algorithms.
72+
- Faster region queries that can scale better to production-sized datasets.
73+
- RAMtools at feature parity with some of the commonly used functionalities in SAMtools, such as `Stat` and `View`.
74+
75+
The combined effect should move RAMtools from a working proof of concept
76+
toward a more usable component of a real genomics pipeline.
404 KB
Loading

0 commit comments

Comments
 (0)