Fix per-sample genotype parsing to fall back from PL to GL to GT#76
Conversation
There was a problem hiding this comment.
Pull request overview
Updates SVDcalculator::ReadVcf genotype parsing to correctly handle per-sample missing FORMAT subfields by attempting PL, then GL, then GT for each sample, rather than selecting one strategy for the whole marker.
Changes:
- Look up PL/GL/GT FORMAT indices once per marker and perform per-sample fallback parsing (PL > GL > GT).
- Fix GT parsing to treat diploid genotypes as 2 alleles after splitting on
/or|. - Clarify intent with comments and simplify indexing (direct
[0..2]access for PL/GL triples).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| if (!parsed) nMissingGenoSamples++; | ||
|
|
||
| if ((phred11 < 0) || (phred12 < 0) || (phred22 < 0)) { |
There was a problem hiding this comment.
When parsed remains false, the code increments nMissingGenoSamples but still continues to run the phred validation/clamping and then assigns perMarkerGeno[i] based on whatever phred11/12/22 were from a previous sample (they are not reinitialized inside the sample loop). This will silently turn missing genotypes into non-missing calls and can leak state across samples. Consider reinitializing phred11/12/22 at the start of each sample and, if !parsed, leave perMarkerGeno[i] as -1 and continue before computing minGeno.
| if (!parsed && idxPL >= 0) { | ||
| phred.ReplaceTokens(pMarker->asSampleValues[idxPL + i * formatLength], ","); | ||
| if (phred.Length() == 3 && phred[0] != ".") { | ||
| phred11 = phred[idx11].AsInteger(); | ||
| phred12 = phred[idx12].AsInteger(); | ||
| phred22 = phred[idx22].AsInteger(); | ||
| phred11 = phred[0].AsInteger(); | ||
| phred12 = phred[1].AsInteger(); | ||
| phred22 = phred[2].AsInteger(); | ||
| parsed = true; |
There was a problem hiding this comment.
The PL parsing only checks phred[0] != "." before converting all three fields to integers. In this codebase String::AsInteger() returns 0 when conversion fails (e.g., for "."), so partial-missing values like "0,.,." would be mis-parsed as valid likelihoods. Consider requiring all three tokens to be present and non-"." before calling AsInteger() (or using the checked AsInteger(long&) overload).
| if (!parsed && idxGL >= 0) { | ||
| phred.ReplaceTokens(pMarker->asSampleValues[idxGL + i * formatLength], ","); | ||
| if (phred.Length() == 3 && phred[0] != ".") { | ||
| phred11 = | ||
| static_cast<int>(-10. * phred[idx11].AsDouble()); | ||
| static_cast<int>(-10. * phred[0].AsDouble()); | ||
| phred12 = | ||
| static_cast<int>(-10. * phred[idx12].AsDouble()); | ||
| static_cast<int>(-10. * phred[1].AsDouble()); | ||
| phred22 = | ||
| static_cast<int>(-10. * phred[idx22].AsDouble()); | ||
| static_cast<int>(-10. * phred[2].AsDouble()); |
There was a problem hiding this comment.
The GL parsing has the same partial-missing issue as PL: it only checks phred[0] != "." before converting phred[1]/phred[2]. Since String::AsDouble()/AsInteger() return 0 on failed conversion, GL values like "-0.1,.,." could be treated as real data. Consider validating all three tokens are non-missing before converting (or using checked conversion APIs).
| if (!parsed && idxGT >= 0) { | ||
| StringArray alleles; | ||
| alleles.ReplaceTokens(pMarker->asSampleValues[idxGT + i * formatLength], "|/"); | ||
| if (alleles.Length() == 2 && alleles[0] != ".") { |
There was a problem hiding this comment.
GT parsing checks alleles[0] != "." but not alleles[1]. In this codebase String::AsInteger() returns 0 on non-integer input, so genotypes like "./1" or "0/." will be mis-parsed as valid (often as 0/1). Consider requiring both alleles to be non-missing before converting, and leaving the genotype missing otherwise.
| if (alleles.Length() == 2 && alleles[0] != ".") { | |
| if (alleles.Length() == 2 && alleles[0] != "." && alleles[1] != ".") { |
| // Look up FORMAT field indices for genotype data. A marker's FORMAT | ||
| // may contain any combination of PL, GL, and GT. Individual samples | ||
| // may have missing values for some fields, so we try each per-sample | ||
| // in priority order: PL > GL > GT. | ||
| int idxPL = pMarker->asFormatKeys.Find("PL"); | ||
| int idxGL = pMarker->asFormatKeys.Find("GL"); | ||
| int idxGT = pMarker->asFormatKeys.Find("GT"); | ||
| if (idxPL < 0 && idxGL < 0 && idxGT < 0) { | ||
| throw VcfFileException("Cannot recognize GT, GL or PL key in FORMAT field"); |
There was a problem hiding this comment.
This logic changes genotype parsing semantics (per-sample fallback PL > GL > GT), but there is no automated coverage exercising --RefVCF / SVDcalculator::ReadVcf. Since the repo already uses CTest integration tests, consider adding a small VCF fixture and a regression test that runs VerifyBamID --RefVCF ... and asserts expected behavior for samples with missing PL/GL but present GT (and for GT-only markers).
|
@tfenne thanks for the PR! I think the updates and bug fixes make sense. Would you mind to address @copilot reviews before I approve and merge? thanks! |
Two issues in SVDcalculator::ReadVcf genotype parsing: 1. The code selected a single FORMAT field strategy (PL, GL, or GT) per marker and used it for all samples. If PL was present in the FORMAT header but a specific sample had PL=., that sample was counted as missing even when a valid GT was available for it. This is common in GATK-generated multi-sample call sets where homref genotypes computed from ref blocks in gVCF may not get PLs. 2. The GT branch checked for 3 tokens after splitting the genotype by "|/", but a diploid genotype like "0/1" splits into 2 tokens, not 3. This meant every GT-only sample was always counted as missing, and markers quickly exceeded the 20% missing-rate threshold. Fix: look up PL, GL, and GT indices per marker, then for each sample try them in priority order (PL > GL > GT), falling back when a field is missing. The GT branch now correctly checks for 2 tokens and uses a separate `alleles` variable rather than reusing `phred`.
Previously only the first token was checked for "." before converting all values. Partially-missing fields like "0,.,." (PL/GL) or "./1" (GT) would silently mis-parse because AsInteger()/AsDouble() return 0 for ".". Now all tokens in the triplet (PL, GL) or both alleles (GT) must be non-missing for the sample to be considered parsed.
When a sample's genotype could not be parsed (all of PL, GL, GT missing), the code incremented the missing counter but still fell through to phred validation, clamping, and genotype assignment — using whatever phred11/12/22 values were left from the previous sample. This silently turned missing genotypes into copies of their neighbor. Fix by moving phred variable declarations inside the per-sample loop so they are always zero-initialized, and adding a `continue` after incrementing the missing counter so unparsed samples retain their default perMarkerGeno value of -1.
Adds a test VCF with 12 rows covering PL-only, GL-only, GT-only, PL/GL/GT fallback, partial-missing fields (e.g. "50,.,."), phased genotypes, and priority ordering. Each row encodes EXPECTED_AF and EXPECTED_N_MISSING in the INFO field. The test calls ReadVcf directly and validates computed AF and missing counts match expectations.
220dd05 to
8ef3290
Compare
|
Thanks @Griffan - I've rebased this PR branch on top of the merged commits on I wanted to keep the test relatively simple and tightly-scoped, so I've added a small test VCF with a wide variety of format fields and sample missingness, and verify that for each variant ReadVcf can read it and compute the correct number of missing samples and the correct allele frequency. And thank you for merging my other PRs! |
I encountered two issues in SVDcalculator::ReadVcf genotype parsing:
The code selected a single FORMAT field strategy (PL, GL, or GT) per marker and used it for all samples. If PL was present in the FORMAT header but a specific sample had PL=., that sample was counted as missing even when a valid GT was available for it. This is common in GATK-generated multi-sample call sets where homref genotypes computed from ref blocks in gVCF may not get PLs, but samples with non-
0/0genotypes do.The GT branch checked for 3 tokens after splitting the genotype by "|/", but a diploid genotype like "0/1" splits into 2 tokens, not 3. This meant every GT-only sample was always counted as missing, and markers quickly exceeded the 20% missing-rate threshold. I think the check for a 3-way split was just a copy/paste mistake from the code above where it splits the PL and GL fields.
Fix: look up PL, GL, and GT indices per marker, then for each sample try them in priority order (PL > GL > GT), falling back when a field is missing. The GT branch now correctly checks for 2 tokens and uses a separate
allelesvariable rather than reusingphred.