|
| 1 | +# Reference Catalog Build Guide |
| 2 | + |
| 3 | +This document describes how to build and maintain the human reference genome catalog |
| 4 | +used by ref-solver. |
| 5 | + |
| 6 | +## Overview |
| 7 | + |
| 8 | +Each reference in the catalog is built from three inputs: |
| 9 | + |
| 10 | +1. **Sequence dictionary (`.dict`)** — provides contig names, lengths, MD5 checksums, |
| 11 | + and SAM metadata (AS, SP, UR, AN tags) |
| 12 | +2. **FASTA file** — provides sha512t24u digests (and MD5 for validation) |
| 13 | +3. **NCBI assembly report** — provides sequence roles, aliases, and cross-reference |
| 14 | + between naming conventions |
| 15 | + |
| 16 | +The build script (`build_catalog.sh`) runs `ref-solver catalog build` for each |
| 17 | +reference, then merges the per-reference JSON files into `human_references.json`. |
| 18 | + |
| 19 | +## Reference Data Location |
| 20 | + |
| 21 | +Reference FASTA and dict files live on scratch storage: |
| 22 | + |
| 23 | +```text |
| 24 | +/Volumes/scratch-00001/data/references/ |
| 25 | + Homo_sapiens_assembly38/ # Broad GRCh38 analysis set |
| 26 | + GRCh38_reference_genome/ # 1000 Genomes GRCh38 |
| 27 | + hs38DH/ # bwakit GRCh38 + ALT + decoy + HLA |
| 28 | + hg38/ # UCSC hg38 |
| 29 | + hg38_p12_ucsc/ # UCSC hg38 with p12 patches |
| 30 | + GRCh38.p13/ # NCBI GRCh38.p13 |
| 31 | + grch38_dragen/ # Illumina DRAGEN GRCh38 |
| 32 | + grch38_dragen_altmasked/ # Illumina DRAGEN GRCh38 ALT-masked |
| 33 | + grch38_gdc/ # NCI GDC GRCh38.d1.vd1 |
| 34 | + Homo_sapiens_assembly19/ # Broad b37 |
| 35 | + hs37d5/ # 1000 Genomes GRCh37 + decoy |
| 36 | + hg19/ # UCSC hg19 |
| 37 | + grch37_ncbi/ # NCBI GRCh37.p13 |
| 38 | + chm13v2/ # T2T-CHM13v2.0 |
| 39 | +``` |
| 40 | + |
| 41 | +## Assembly Reports |
| 42 | + |
| 43 | +Assembly reports are stored in `catalogs/sources/` and map between naming conventions |
| 44 | +(UCSC, RefSeq, GenBank, sequence name) for each assembly version. |
| 45 | + |
| 46 | +### Which report for which reference |
| 47 | + |
| 48 | +**Critical**: Each reference must use the assembly report matching its patch level. |
| 49 | +Using the wrong patch report causes missing AN tags for patch-specific contigs. |
| 50 | + |
| 51 | +| Report | Patch | References | |
| 52 | +|--------|-------|------------| |
| 53 | +| `GCF_000001405.40_GRCh38.p14_assembly_report.txt` | GRCh38.p14 | Broad, 1kg, hs38DH, hg38, DRAGEN, DRAGEN-altmasked, GDC | |
| 54 | +| `GCF_000001405.39_GRCh38.p13_assembly_report.txt` | GRCh38.p13 | grch38_ncbi (FASTA is p13) | |
| 55 | +| `GCF_000001405.38_GRCh38.p12_assembly_report.txt` | GRCh38.p12 | hg38_p12_ucsc | |
| 56 | +| `GCF_000001405.25_GRCh37.p13_assembly_report.txt` | GRCh37.p13 | b37, hs37d5, hg19, grch37_ncbi | |
| 57 | +| `GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt` | T2T-CHM13v2.0 | chm13v2 | |
| 58 | + |
| 59 | +References without patches (Broad, 1kg, hg38, DRAGEN, etc.) can use the latest p14 |
| 60 | +report since primary assembly contigs don't change between patches. |
| 61 | + |
| 62 | +### Additional assembly reports for non-primary contigs |
| 63 | + |
| 64 | +| Report | Assembly | Contigs | |
| 65 | +|--------|----------|---------| |
| 66 | +| `GCA_000786075.2_hs38d1_assembly_report.txt` | hs38d1 decoys | KN707\* (387) and JTFH\* (1998) decoy contigs | |
| 67 | +| `GCF_002402265.1_ASM240226v1_assembly_report.txt` | EBV | NC_007605.1 / chrEBV (1 contig) | |
| 68 | + |
| 69 | +These are used when adding AN tags to dict files (see below) but are not directly |
| 70 | +consumed by `ref-solver catalog build`. |
| 71 | + |
| 72 | +## Building Dict Files |
| 73 | + |
| 74 | +Each dict file should have complete SAM metadata on every `@SQ` line: |
| 75 | + |
| 76 | +- **SN** — sequence name (from `samtools dict`) |
| 77 | +- **LN** — sequence length |
| 78 | +- **M5** — MD5 checksum |
| 79 | +- **AS** — assembly identifier (GRCh38, GRCh37, or T2T-CHM13v2.0) |
| 80 | +- **SP** — species (`Homo sapiens`) |
| 81 | +- **UR** — canonical remote download URL for the FASTA |
| 82 | +- **AN** — alternate contig names from assembly reports |
| 83 | + |
| 84 | +### Step 1: Create base dict |
| 85 | + |
| 86 | +```bash |
| 87 | +samtools dict <reference.fasta> > <reference.dict> |
| 88 | +``` |
| 89 | + |
| 90 | +This provides SN, LN, M5. The UR tag will be the local FASTA path (overridden later). |
| 91 | + |
| 92 | +### Step 2: Set AS, SP, UR tags |
| 93 | + |
| 94 | +Update all `@SQ` lines with the correct assembly, species, and remote URL. A simple |
| 95 | +Python script can iterate over `@SQ` lines and replace/add the AS, SP, and UR tags. |
| 96 | +The UR value should match the `--download-url` in `build_catalog.sh`. |
| 97 | + |
| 98 | +| Reference | AS value | |
| 99 | +|-----------|----------| |
| 100 | +| All GRCh38-based | GRCh38 | |
| 101 | +| All GRCh37-based | GRCh37 | |
| 102 | +| T2T-CHM13v2.0 | T2T-CHM13v2.0 | |
| 103 | + |
| 104 | +### Step 3: Add AN tags with fgbio |
| 105 | + |
| 106 | +Use `fgbio CollectAlternateContigNames` to add alternate contig names from assembly |
| 107 | +reports. Install fgbio via the pixi.toml in the repo root (`pixi run fgbio ...`). |
| 108 | + |
| 109 | +```bash |
| 110 | +pixi run fgbio CollectAlternateContigNames \ |
| 111 | + --existing <dict> \ |
| 112 | + --assembly-report <assembly_report> \ |
| 113 | + --output <dict> \ |
| 114 | + --primary <primary_column> \ |
| 115 | + --alternates <alt1> <alt2> <alt3> \ |
| 116 | + --sequence-roles AssembledMolecule UnlocalizedScaffold UnplacedScaffold \ |
| 117 | + FixPatch NovelPatch AltScaffold |
| 118 | +``` |
| 119 | + |
| 120 | +The `--primary` flag must match the naming convention used by the dict's SN values: |
| 121 | + |
| 122 | +| Dict naming convention | `--primary` | `--alternates` | Examples | |
| 123 | +|----------------------|-------------|----------------|----------| |
| 124 | +| UCSC (`chr1`) | `UcscName` | `AssignedMolecule GenBankAccession RefSeqAccession` | hg38, hg19, Broad, DRAGEN, GDC | |
| 125 | +| Numeric (`1`) | `AssignedMolecule` | `UcscName GenBankAccession RefSeqAccession` | b37, hs37d5 | |
| 126 | +| RefSeq (`NC_000001.11`) | `RefSeqAccession` | `AssignedMolecule GenBankAccession UcscName` | GRCh38.p13, grch37_ncbi, chm13v2 | |
| 127 | + |
| 128 | +### Mixed naming conventions (b37, hs37d5) |
| 129 | + |
| 130 | +The b37 and hs37d5 dicts use `AssignedMolecule` for chromosomes (`1`, `2`, ..., `MT`) |
| 131 | +but `GenBankAccession` for scaffolds (`GL000207.1`, etc.). No single `--primary` column |
| 132 | +matches all contigs. Use a two-pass approach: |
| 133 | + |
| 134 | +1. Pass 1: `--primary AssignedMolecule` (picks up chromosomes: 1-22, X, Y, MT) |
| 135 | +2. Pass 2: `--primary GenBankAccession` (picks up GL* scaffolds) |
| 136 | + |
| 137 | +### Decoy contigs (hs38d1) |
| 138 | + |
| 139 | +Decoy contigs (KN707\*, JTFH\*) come from the hs38d1 assembly (`GCA_000786075.2`), not |
| 140 | +the main GRCh38 assembly. The dict SN names use UCSC-style format |
| 141 | +(`chrUn_KN707606v1_decoy`) but the assembly report uses GenBank accessions |
| 142 | +(`KN707606.1`). There is no direct column match, so fgbio cannot be used directly. |
| 143 | + |
| 144 | +Instead, a Python script maps between the naming conventions: |
| 145 | +- Strip `chrUn_` prefix and `_decoy` suffix from the SN |
| 146 | +- Convert version format: `KN707606v1` -> `KN707606.1` |
| 147 | +- Look up in the hs38d1 assembly report's GenBank column |
| 148 | +- Build AN from SequenceName (`decoy00416`) and GenBank accession (`KN707606.1`) |
| 149 | + |
| 150 | +### EBV |
| 151 | + |
| 152 | +EBV (`chrEBV` / `NC_007605`) has its own assembly report |
| 153 | +(`GCF_002402265.1_ASM240226v1`). For UCSC-named dicts: `AN:NC_007605.1,AJ507799.2`. |
| 154 | +For RefSeq/numeric-named dicts: `AN:AJ507799.2,chrEBV`. |
| 155 | + |
| 156 | +### HLA alleles |
| 157 | + |
| 158 | +HLA contigs (`HLA-A*01:01:01:01`, etc.) come from the IPD-IMGT/HLA database. There |
| 159 | +is no NCBI assembly report for these — they cannot get AN tags. This affects Broad, |
| 160 | +1kg, hs38DH, and DRAGEN-altmasked (525 contigs each). |
| 161 | + |
| 162 | +### Viral contigs (GDC only) |
| 163 | + |
| 164 | +The GDC reference includes 199 viral sequences (HPV, CMV, HBV, HCV, HIV). These have |
| 165 | +no NCBI assembly reports and cannot get AN tags. |
| 166 | + |
| 167 | +## Building the Catalog |
| 168 | + |
| 169 | +### Prerequisites |
| 170 | + |
| 171 | +- Release binary: `cargo build --release` |
| 172 | +- Reference FASTA/dict files on scratch storage |
| 173 | +- Assembly reports in `catalogs/sources/` |
| 174 | + |
| 175 | +### Run the build |
| 176 | + |
| 177 | +```bash |
| 178 | +./catalogs/build_catalog.sh /Volumes/scratch-00001/data/references |
| 179 | +``` |
| 180 | + |
| 181 | +This runs `ref-solver catalog build` for each reference with: |
| 182 | +- `--download-url` — overrides local dict UR with the canonical remote URL |
| 183 | +- `--species "Homo sapiens"` — sets the species on all contigs |
| 184 | +- `--assembly` — sets the assembly enum and propagates to contig assembly fields |
| 185 | +- `--assembly-report-url` — stored as metadata on the reference |
| 186 | + |
| 187 | +The script merges all per-reference JSON files into `catalogs/human_references.json`, |
| 188 | +which is embedded into the binary at compile time. |
| 189 | + |
| 190 | +### After rebuilding |
| 191 | + |
| 192 | +```bash |
| 193 | +cargo build --release # Embed new catalog |
| 194 | +cargo ci-test # Verify tests pass |
| 195 | +cargo ci-fmt # Check formatting |
| 196 | +cargo ci-lint # Check clippy |
| 197 | +``` |
| 198 | + |
| 199 | +## Adding a New Reference |
| 200 | + |
| 201 | +1. Obtain the FASTA and create a dict with `samtools dict` |
| 202 | +2. Update the dict with correct AS/SP/UR tags |
| 203 | +3. Add AN tags using fgbio with the appropriate assembly report |
| 204 | +4. Add a new entry to `build_catalog.sh` following the existing pattern |
| 205 | +5. Add the output JSON filename to the `jq -s` merge command at the bottom |
| 206 | +6. Run the build and verify with `cargo ci-test` |
0 commit comments