Skip to content

Commit 0444b16

Browse files
committed
feat: rebuild catalog with sha512t24u digests
Rebuild all 14 reference catalog entries using dict + FASTA + assembly report inputs. The FASTA provides sha512t24u digests (88% coverage), the dict provides SAM metadata (AS, UR, SP, AN), and the assembly report provides sequence roles and aliases. - Add hg38_p12_ucsc to build script (was missing) - Fix hg19_ucsc chr3 MD5 (old dict was generated from wrong FASTA) - Build script now takes refs_dir as a required argument
1 parent 66c3747 commit 0444b16

21 files changed

Lines changed: 327531 additions & 101139 deletions

catalogs/README.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Reference Catalog Build Guide
2+
3+
This document describes how to build and maintain the human reference genome catalog
4+
used by ref-solver.
5+
6+
## Overview
7+
8+
Each reference in the catalog is built from three inputs:
9+
10+
1. **Sequence dictionary (`.dict`)** — provides contig names, lengths, MD5 checksums,
11+
and SAM metadata (AS, SP, UR, AN tags)
12+
2. **FASTA file** — provides sha512t24u digests (and MD5 for validation)
13+
3. **NCBI assembly report** — provides sequence roles, aliases, and cross-reference
14+
between naming conventions
15+
16+
The build script (`build_catalog.sh`) runs `ref-solver catalog build` for each
17+
reference, then merges the per-reference JSON files into `human_references.json`.
18+
19+
## Reference Data Location
20+
21+
Reference FASTA and dict files live on scratch storage:
22+
23+
```text
24+
/Volumes/scratch-00001/data/references/
25+
Homo_sapiens_assembly38/ # Broad GRCh38 analysis set
26+
GRCh38_reference_genome/ # 1000 Genomes GRCh38
27+
hs38DH/ # bwakit GRCh38 + ALT + decoy + HLA
28+
hg38/ # UCSC hg38
29+
hg38_p12_ucsc/ # UCSC hg38 with p12 patches
30+
GRCh38.p13/ # NCBI GRCh38.p13
31+
grch38_dragen/ # Illumina DRAGEN GRCh38
32+
grch38_dragen_altmasked/ # Illumina DRAGEN GRCh38 ALT-masked
33+
grch38_gdc/ # NCI GDC GRCh38.d1.vd1
34+
Homo_sapiens_assembly19/ # Broad b37
35+
hs37d5/ # 1000 Genomes GRCh37 + decoy
36+
hg19/ # UCSC hg19
37+
grch37_ncbi/ # NCBI GRCh37.p13
38+
chm13v2/ # T2T-CHM13v2.0
39+
```
40+
41+
## Assembly Reports
42+
43+
Assembly reports are stored in `catalogs/sources/` and map between naming conventions
44+
(UCSC, RefSeq, GenBank, sequence name) for each assembly version.
45+
46+
### Which report for which reference
47+
48+
**Critical**: Each reference must use the assembly report matching its patch level.
49+
Using the wrong patch report causes missing AN tags for patch-specific contigs.
50+
51+
| Report | Patch | References |
52+
|--------|-------|------------|
53+
| `GCF_000001405.40_GRCh38.p14_assembly_report.txt` | GRCh38.p14 | Broad, 1kg, hs38DH, hg38, DRAGEN, DRAGEN-altmasked, GDC |
54+
| `GCF_000001405.39_GRCh38.p13_assembly_report.txt` | GRCh38.p13 | grch38_ncbi (FASTA is p13) |
55+
| `GCF_000001405.38_GRCh38.p12_assembly_report.txt` | GRCh38.p12 | hg38_p12_ucsc |
56+
| `GCF_000001405.25_GRCh37.p13_assembly_report.txt` | GRCh37.p13 | b37, hs37d5, hg19, grch37_ncbi |
57+
| `GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt` | T2T-CHM13v2.0 | chm13v2 |
58+
59+
References without patches (Broad, 1kg, hg38, DRAGEN, etc.) can use the latest p14
60+
report since primary assembly contigs don't change between patches.
61+
62+
### Additional assembly reports for non-primary contigs
63+
64+
| Report | Assembly | Contigs |
65+
|--------|----------|---------|
66+
| `GCA_000786075.2_hs38d1_assembly_report.txt` | hs38d1 decoys | KN707\* (387) and JTFH\* (1998) decoy contigs |
67+
| `GCF_002402265.1_ASM240226v1_assembly_report.txt` | EBV | NC_007605.1 / chrEBV (1 contig) |
68+
69+
These are used when adding AN tags to dict files (see below) but are not directly
70+
consumed by `ref-solver catalog build`.
71+
72+
## Building Dict Files
73+
74+
Each dict file should have complete SAM metadata on every `@SQ` line:
75+
76+
- **SN** — sequence name (from `samtools dict`)
77+
- **LN** — sequence length
78+
- **M5** — MD5 checksum
79+
- **AS** — assembly identifier (GRCh38, GRCh37, or T2T-CHM13v2.0)
80+
- **SP** — species (`Homo sapiens`)
81+
- **UR** — canonical remote download URL for the FASTA
82+
- **AN** — alternate contig names from assembly reports
83+
84+
### Step 1: Create base dict
85+
86+
```bash
87+
samtools dict <reference.fasta> > <reference.dict>
88+
```
89+
90+
This provides SN, LN, M5. The UR tag will be the local FASTA path (overridden later).
91+
92+
### Step 2: Set AS, SP, UR tags
93+
94+
Update all `@SQ` lines with the correct assembly, species, and remote URL. A simple
95+
Python script can iterate over `@SQ` lines and replace/add the AS, SP, and UR tags.
96+
The UR value should match the `--download-url` in `build_catalog.sh`.
97+
98+
| Reference | AS value |
99+
|-----------|----------|
100+
| All GRCh38-based | GRCh38 |
101+
| All GRCh37-based | GRCh37 |
102+
| T2T-CHM13v2.0 | T2T-CHM13v2.0 |
103+
104+
### Step 3: Add AN tags with fgbio
105+
106+
Use `fgbio CollectAlternateContigNames` to add alternate contig names from assembly
107+
reports. Install fgbio via the pixi.toml in the repo root (`pixi run fgbio ...`).
108+
109+
```bash
110+
pixi run fgbio CollectAlternateContigNames \
111+
--existing <dict> \
112+
--assembly-report <assembly_report> \
113+
--output <dict> \
114+
--primary <primary_column> \
115+
--alternates <alt1> <alt2> <alt3> \
116+
--sequence-roles AssembledMolecule UnlocalizedScaffold UnplacedScaffold \
117+
FixPatch NovelPatch AltScaffold
118+
```
119+
120+
The `--primary` flag must match the naming convention used by the dict's SN values:
121+
122+
| Dict naming convention | `--primary` | `--alternates` | Examples |
123+
|----------------------|-------------|----------------|----------|
124+
| UCSC (`chr1`) | `UcscName` | `AssignedMolecule GenBankAccession RefSeqAccession` | hg38, hg19, Broad, DRAGEN, GDC |
125+
| Numeric (`1`) | `AssignedMolecule` | `UcscName GenBankAccession RefSeqAccession` | b37, hs37d5 |
126+
| RefSeq (`NC_000001.11`) | `RefSeqAccession` | `AssignedMolecule GenBankAccession UcscName` | GRCh38.p13, grch37_ncbi, chm13v2 |
127+
128+
### Mixed naming conventions (b37, hs37d5)
129+
130+
The b37 and hs37d5 dicts use `AssignedMolecule` for chromosomes (`1`, `2`, ..., `MT`)
131+
but `GenBankAccession` for scaffolds (`GL000207.1`, etc.). No single `--primary` column
132+
matches all contigs. Use a two-pass approach:
133+
134+
1. Pass 1: `--primary AssignedMolecule` (picks up chromosomes: 1-22, X, Y, MT)
135+
2. Pass 2: `--primary GenBankAccession` (picks up GL* scaffolds)
136+
137+
### Decoy contigs (hs38d1)
138+
139+
Decoy contigs (KN707\*, JTFH\*) come from the hs38d1 assembly (`GCA_000786075.2`), not
140+
the main GRCh38 assembly. The dict SN names use UCSC-style format
141+
(`chrUn_KN707606v1_decoy`) but the assembly report uses GenBank accessions
142+
(`KN707606.1`). There is no direct column match, so fgbio cannot be used directly.
143+
144+
Instead, a Python script maps between the naming conventions:
145+
- Strip `chrUn_` prefix and `_decoy` suffix from the SN
146+
- Convert version format: `KN707606v1` -> `KN707606.1`
147+
- Look up in the hs38d1 assembly report's GenBank column
148+
- Build AN from SequenceName (`decoy00416`) and GenBank accession (`KN707606.1`)
149+
150+
### EBV
151+
152+
EBV (`chrEBV` / `NC_007605`) has its own assembly report
153+
(`GCF_002402265.1_ASM240226v1`). For UCSC-named dicts: `AN:NC_007605.1,AJ507799.2`.
154+
For RefSeq/numeric-named dicts: `AN:AJ507799.2,chrEBV`.
155+
156+
### HLA alleles
157+
158+
HLA contigs (`HLA-A*01:01:01:01`, etc.) come from the IPD-IMGT/HLA database. There
159+
is no NCBI assembly report for these — they cannot get AN tags. This affects Broad,
160+
1kg, hs38DH, and DRAGEN-altmasked (525 contigs each).
161+
162+
### Viral contigs (GDC only)
163+
164+
The GDC reference includes 199 viral sequences (HPV, CMV, HBV, HCV, HIV). These have
165+
no NCBI assembly reports and cannot get AN tags.
166+
167+
## Building the Catalog
168+
169+
### Prerequisites
170+
171+
- Release binary: `cargo build --release`
172+
- Reference FASTA/dict files on scratch storage
173+
- Assembly reports in `catalogs/sources/`
174+
175+
### Run the build
176+
177+
```bash
178+
./catalogs/build_catalog.sh /Volumes/scratch-00001/data/references
179+
```
180+
181+
This runs `ref-solver catalog build` for each reference with:
182+
- `--download-url` — overrides local dict UR with the canonical remote URL
183+
- `--species "Homo sapiens"` — sets the species on all contigs
184+
- `--assembly` — sets the assembly enum and propagates to contig assembly fields
185+
- `--assembly-report-url` — stored as metadata on the reference
186+
187+
The script merges all per-reference JSON files into `catalogs/human_references.json`,
188+
which is embedded into the binary at compile time.
189+
190+
### After rebuilding
191+
192+
```bash
193+
cargo build --release # Embed new catalog
194+
cargo ci-test # Verify tests pass
195+
cargo ci-fmt # Check formatting
196+
cargo ci-lint # Check clippy
197+
```
198+
199+
## Adding a New Reference
200+
201+
1. Obtain the FASTA and create a dict with `samtools dict`
202+
2. Update the dict with correct AS/SP/UR tags
203+
3. Add AN tags using fgbio with the appropriate assembly report
204+
4. Add a new entry to `build_catalog.sh` following the existing pattern
205+
5. Add the output JSON filename to the `jq -s` merge command at the bottom
206+
6. Run the build and verify with `cargo ci-test`

0 commit comments

Comments
 (0)