fulcrumgenomics
diff --git a/‎catalogs/README.md‎
Lines changed: 206 additions & 0 deletions b/‎catalogs/README.md‎
Lines changed: 206 additions & 0 deletions
@@ -0,0 +1,206 @@
+# Reference Catalog Build Guide
+
+This document describes how to build and maintain the human reference genome catalog
+used by ref-solver.
+
+## Overview
+
+Each reference in the catalog is built from three inputs:
+
+1. **Sequence dictionary (`.dict`)** — provides contig names, lengths, MD5 checksums,
+   and SAM metadata (AS, SP, UR, AN tags)
+2. **FASTA file** — provides sha512t24u digests (and MD5 for validation)
+3. **NCBI assembly report** — provides sequence roles, aliases, and cross-reference
+   between naming conventions
+
+The build script (`build_catalog.sh`) runs `ref-solver catalog build` for each
+reference, then merges the per-reference JSON files into `human_references.json`.
+
+## Reference Data Location
+
+Reference FASTA and dict files live on scratch storage:
+
+```text
+/Volumes/scratch-00001/data/references/
+  Homo_sapiens_assembly38/    # Broad GRCh38 analysis set
+  GRCh38_reference_genome/    # 1000 Genomes GRCh38
+  hs38DH/                     # bwakit GRCh38 + ALT + decoy + HLA
+  hg38/                       # UCSC hg38
+  hg38_p12_ucsc/              # UCSC hg38 with p12 patches
+  GRCh38.p13/                 # NCBI GRCh38.p13
+  grch38_dragen/              # Illumina DRAGEN GRCh38
+  grch38_dragen_altmasked/    # Illumina DRAGEN GRCh38 ALT-masked
+  grch38_gdc/                 # NCI GDC GRCh38.d1.vd1
+  Homo_sapiens_assembly19/    # Broad b37
+  hs37d5/                     # 1000 Genomes GRCh37 + decoy
+  hg19/                       # UCSC hg19
+  grch37_ncbi/                # NCBI GRCh37.p13
+  chm13v2/                    # T2T-CHM13v2.0
+```
+
+## Assembly Reports
+
+Assembly reports are stored in `catalogs/sources/` and map between naming conventions
+(UCSC, RefSeq, GenBank, sequence name) for each assembly version.
+
+### Which report for which reference
+
+**Critical**: Each reference must use the assembly report matching its patch level.
+Using the wrong patch report causes missing AN tags for patch-specific contigs.
+
+| Report | Patch | References |
+|--------|-------|------------|
+| `GCF_000001405.40_GRCh38.p14_assembly_report.txt` | GRCh38.p14 | Broad, 1kg, hs38DH, hg38, DRAGEN, DRAGEN-altmasked, GDC |
+| `GCF_000001405.39_GRCh38.p13_assembly_report.txt` | GRCh38.p13 | grch38_ncbi (FASTA is p13) |
+| `GCF_000001405.38_GRCh38.p12_assembly_report.txt` | GRCh38.p12 | hg38_p12_ucsc |
+| `GCF_000001405.25_GRCh37.p13_assembly_report.txt` | GRCh37.p13 | b37, hs37d5, hg19, grch37_ncbi |
+| `GCF_009914755.1_T2T-CHM13v2.0_assembly_report.txt` | T2T-CHM13v2.0 | chm13v2 |
+
+References without patches (Broad, 1kg, hg38, DRAGEN, etc.) can use the latest p14
+report since primary assembly contigs don't change between patches.
+
+### Additional assembly reports for non-primary contigs
+
+| Report | Assembly | Contigs |
+|--------|----------|---------|
+| `GCA_000786075.2_hs38d1_assembly_report.txt` | hs38d1 decoys | KN707\* (387) and JTFH\* (1998) decoy contigs |
+| `GCF_002402265.1_ASM240226v1_assembly_report.txt` | EBV | NC_007605.1 / chrEBV (1 contig) |
+
+These are used when adding AN tags to dict files (see below) but are not directly
+consumed by `ref-solver catalog build`.
+
+## Building Dict Files
+
+Each dict file should have complete SAM metadata on every `@SQ` line:
+
+- **SN** — sequence name (from `samtools dict`)
+- **LN** — sequence length
+- **M5** — MD5 checksum
+- **AS** — assembly identifier (GRCh38, GRCh37, or T2T-CHM13v2.0)
+- **SP** — species (`Homo sapiens`)
+- **UR** — canonical remote download URL for the FASTA
+- **AN** — alternate contig names from assembly reports
+
+### Step 1: Create base dict
+
+```bash
+samtools dict <reference.fasta> > <reference.dict>
+```
+
+This provides SN, LN, M5. The UR tag will be the local FASTA path (overridden later).
+
+### Step 2: Set AS, SP, UR tags
+
+Update all `@SQ` lines with the correct assembly, species, and remote URL. A simple
+Python script can iterate over `@SQ` lines and replace/add the AS, SP, and UR tags.
+The UR value should match the `--download-url` in `build_catalog.sh`.
+
+| Reference | AS value |
+|-----------|----------|
+| All GRCh38-based | GRCh38 |
+| All GRCh37-based | GRCh37 |
+| T2T-CHM13v2.0 | T2T-CHM13v2.0 |
+
+### Step 3: Add AN tags with fgbio
+
+Use `fgbio CollectAlternateContigNames` to add alternate contig names from assembly
+reports. Install fgbio via the pixi.toml in the repo root (`pixi run fgbio ...`).
+
+```bash
+pixi run fgbio CollectAlternateContigNames \
+  --existing <dict> \
+  --assembly-report <assembly_report> \
+  --output <dict> \
+  --primary <primary_column> \
+  --alternates <alt1> <alt2> <alt3> \
+  --sequence-roles AssembledMolecule UnlocalizedScaffold UnplacedScaffold \
+                   FixPatch NovelPatch AltScaffold
+```
+
+The `--primary` flag must match the naming convention used by the dict's SN values:
+
+| Dict naming convention | `--primary` | `--alternates` | Examples |
+|----------------------|-------------|----------------|----------|
+| UCSC (`chr1`) | `UcscName` | `AssignedMolecule GenBankAccession RefSeqAccession` | hg38, hg19, Broad, DRAGEN, GDC |
+| Numeric (`1`) | `AssignedMolecule` | `UcscName GenBankAccession RefSeqAccession` | b37, hs37d5 |
+| RefSeq (`NC_000001.11`) | `RefSeqAccession` | `AssignedMolecule GenBankAccession UcscName` | GRCh38.p13, grch37_ncbi, chm13v2 |
+
+### Mixed naming conventions (b37, hs37d5)
+
+The b37 and hs37d5 dicts use `AssignedMolecule` for chromosomes (`1`, `2`, ..., `MT`)
+but `GenBankAccession` for scaffolds (`GL000207.1`, etc.). No single `--primary` column
+matches all contigs. Use a two-pass approach:
+
+1. Pass 1: `--primary AssignedMolecule` (picks up chromosomes: 1-22, X, Y, MT)
+2. Pass 2: `--primary GenBankAccession` (picks up GL* scaffolds)
+
+### Decoy contigs (hs38d1)
+
+Decoy contigs (KN707\*, JTFH\*) come from the hs38d1 assembly (`GCA_000786075.2`), not
+the main GRCh38 assembly. The dict SN names use UCSC-style format
+(`chrUn_KN707606v1_decoy`) but the assembly report uses GenBank accessions
+(`KN707606.1`). There is no direct column match, so fgbio cannot be used directly.
+
+Instead, a Python script maps between the naming conventions:
+- Strip `chrUn_` prefix and `_decoy` suffix from the SN
+- Convert version format: `KN707606v1` -> `KN707606.1`
+- Look up in the hs38d1 assembly report's GenBank column
+- Build AN from SequenceName (`decoy00416`) and GenBank accession (`KN707606.1`)
+
+### EBV
+
+EBV (`chrEBV` / `NC_007605`) has its own assembly report
+(`GCF_002402265.1_ASM240226v1`). For UCSC-named dicts: `AN:NC_007605.1,AJ507799.2`.
+For RefSeq/numeric-named dicts: `AN:AJ507799.2,chrEBV`.
+
+### HLA alleles
+
+HLA contigs (`HLA-A*01:01:01:01`, etc.) come from the IPD-IMGT/HLA database. There
+is no NCBI assembly report for these — they cannot get AN tags. This affects Broad,
+1kg, hs38DH, and DRAGEN-altmasked (525 contigs each).
+
+### Viral contigs (GDC only)
+
+The GDC reference includes 199 viral sequences (HPV, CMV, HBV, HCV, HIV). These have
+no NCBI assembly reports and cannot get AN tags.
+
+## Building the Catalog
+
+### Prerequisites
+
+- Release binary: `cargo build --release`
+- Reference FASTA/dict files on scratch storage
+- Assembly reports in `catalogs/sources/`
+
+### Run the build
+
+```bash
+./catalogs/build_catalog.sh /Volumes/scratch-00001/data/references
+```
+
+This runs `ref-solver catalog build` for each reference with:
+- `--download-url` — overrides local dict UR with the canonical remote URL
+- `--species "Homo sapiens"` — sets the species on all contigs
+- `--assembly` — sets the assembly enum and propagates to contig assembly fields
+- `--assembly-report-url` — stored as metadata on the reference
+
+The script merges all per-reference JSON files into `catalogs/human_references.json`,
+which is embedded into the binary at compile time.
+
+### After rebuilding
+
+```bash
+cargo build --release    # Embed new catalog
+cargo ci-test            # Verify tests pass
+cargo ci-fmt             # Check formatting
+cargo ci-lint            # Check clippy
+```
+
+## Adding a New Reference
+
+1. Obtain the FASTA and create a dict with `samtools dict`
+2. Update the dict with correct AS/SP/UR tags
+3. Add AN tags using fgbio with the appropriate assembly report
+4. Add a new entry to `build_catalog.sh` following the existing pattern
+5. Add the output JSON filename to the `jq -s` merge command at the bottom
+6. Run the build and verify with `cargo ci-test`