Skip to content

Fix per-sample genotype parsing to fall back from PL to GL to GT#76

Merged
Griffan merged 4 commits into
Griffan:masterfrom
tfenne:fix-genotype-parsing-fallback
Apr 12, 2026
Merged

Fix per-sample genotype parsing to fall back from PL to GL to GT#76
Griffan merged 4 commits into
Griffan:masterfrom
tfenne:fix-genotype-parsing-fallback

Conversation

@tfenne
Copy link
Copy Markdown
Contributor

@tfenne tfenne commented Apr 9, 2026

I encountered two issues in SVDcalculator::ReadVcf genotype parsing:

  1. The code selected a single FORMAT field strategy (PL, GL, or GT) per marker and used it for all samples. If PL was present in the FORMAT header but a specific sample had PL=., that sample was counted as missing even when a valid GT was available for it. This is common in GATK-generated multi-sample call sets where homref genotypes computed from ref blocks in gVCF may not get PLs, but samples with non-0/0 genotypes do.

  2. The GT branch checked for 3 tokens after splitting the genotype by "|/", but a diploid genotype like "0/1" splits into 2 tokens, not 3. This meant every GT-only sample was always counted as missing, and markers quickly exceeded the 20% missing-rate threshold. I think the check for a 3-way split was just a copy/paste mistake from the code above where it splits the PL and GL fields.

Fix: look up PL, GL, and GT indices per marker, then for each sample try them in priority order (PL > GL > GT), falling back when a field is missing. The GT branch now correctly checks for 2 tokens and uses a separate alleles variable rather than reusing phred.

@Griffan Griffan self-requested a review April 11, 2026 22:18
@Griffan Griffan closed this Apr 11, 2026
@Griffan Griffan reopened this Apr 11, 2026
@Griffan Griffan requested a review from Copilot April 11, 2026 23:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates SVDcalculator::ReadVcf genotype parsing to correctly handle per-sample missing FORMAT subfields by attempting PL, then GL, then GT for each sample, rather than selecting one strategy for the whole marker.

Changes:

  • Look up PL/GL/GT FORMAT indices once per marker and perform per-sample fallback parsing (PL > GL > GT).
  • Fix GT parsing to treat diploid genotypes as 2 alleles after splitting on / or |.
  • Clarify intent with comments and simplify indexing (direct [0..2] access for PL/GL triples).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread SVDcalculator.cpp
Comment on lines 156 to 160
}

if (!parsed) nMissingGenoSamples++;

if ((phred11 < 0) || (phred12 < 0) || (phred22 < 0)) {
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsed remains false, the code increments nMissingGenoSamples but still continues to run the phred validation/clamping and then assigns perMarkerGeno[i] based on whatever phred11/12/22 were from a previous sample (they are not reinitialized inside the sample loop). This will silently turn missing genotypes into non-missing calls and can leak state across samples. Consider reinitializing phred11/12/22 at the start of each sample and, if !parsed, leave perMarkerGeno[i] as -1 and continue before computing minGeno.

Copilot uses AI. Check for mistakes.
Comment thread SVDcalculator.cpp
Comment on lines +109 to +115
if (!parsed && idxPL >= 0) {
phred.ReplaceTokens(pMarker->asSampleValues[idxPL + i * formatLength], ",");
if (phred.Length() == 3 && phred[0] != ".") {
phred11 = phred[idx11].AsInteger();
phred12 = phred[idx12].AsInteger();
phred22 = phred[idx22].AsInteger();
phred11 = phred[0].AsInteger();
phred12 = phred[1].AsInteger();
phred22 = phred[2].AsInteger();
parsed = true;
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PL parsing only checks phred[0] != "." before converting all three fields to integers. In this codebase String::AsInteger() returns 0 when conversion fails (e.g., for "."), so partial-missing values like "0,.,." would be mis-parsed as valid likelihoods. Consider requiring all three tokens to be present and non-"." before calling AsInteger() (or using the checked AsInteger(long&) overload).

Copilot uses AI. Check for mistakes.
Comment thread SVDcalculator.cpp
Comment on lines +120 to +128
if (!parsed && idxGL >= 0) {
phred.ReplaceTokens(pMarker->asSampleValues[idxGL + i * formatLength], ",");
if (phred.Length() == 3 && phred[0] != ".") {
phred11 =
static_cast<int>(-10. * phred[idx11].AsDouble());
static_cast<int>(-10. * phred[0].AsDouble());
phred12 =
static_cast<int>(-10. * phred[idx12].AsDouble());
static_cast<int>(-10. * phred[1].AsDouble());
phred22 =
static_cast<int>(-10. * phred[idx22].AsDouble());
static_cast<int>(-10. * phred[2].AsDouble());
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GL parsing has the same partial-missing issue as PL: it only checks phred[0] != "." before converting phred[1]/phred[2]. Since String::AsDouble()/AsInteger() return 0 on failed conversion, GL values like "-0.1,.,." could be treated as real data. Consider validating all three tokens are non-missing before converting (or using checked conversion APIs).

Copilot uses AI. Check for mistakes.
Comment thread SVDcalculator.cpp Outdated
if (!parsed && idxGT >= 0) {
StringArray alleles;
alleles.ReplaceTokens(pMarker->asSampleValues[idxGT + i * formatLength], "|/");
if (alleles.Length() == 2 && alleles[0] != ".") {
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GT parsing checks alleles[0] != "." but not alleles[1]. In this codebase String::AsInteger() returns 0 on non-integer input, so genotypes like "./1" or "0/." will be mis-parsed as valid (often as 0/1). Consider requiring both alleles to be non-missing before converting, and leaving the genotype missing otherwise.

Suggested change
if (alleles.Length() == 2 && alleles[0] != ".") {
if (alleles.Length() == 2 && alleles[0] != "." && alleles[1] != ".") {

Copilot uses AI. Check for mistakes.
Comment thread SVDcalculator.cpp
Comment on lines +82 to +90
// Look up FORMAT field indices for genotype data. A marker's FORMAT
// may contain any combination of PL, GL, and GT. Individual samples
// may have missing values for some fields, so we try each per-sample
// in priority order: PL > GL > GT.
int idxPL = pMarker->asFormatKeys.Find("PL");
int idxGL = pMarker->asFormatKeys.Find("GL");
int idxGT = pMarker->asFormatKeys.Find("GT");
if (idxPL < 0 && idxGL < 0 && idxGT < 0) {
throw VcfFileException("Cannot recognize GT, GL or PL key in FORMAT field");
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic changes genotype parsing semantics (per-sample fallback PL > GL > GT), but there is no automated coverage exercising --RefVCF / SVDcalculator::ReadVcf. Since the repo already uses CTest integration tests, consider adding a small VCF fixture and a regression test that runs VerifyBamID --RefVCF ... and asserts expected behavior for samples with missing PL/GL but present GT (and for GT-only markers).

Copilot uses AI. Check for mistakes.
@Griffan
Copy link
Copy Markdown
Owner

Griffan commented Apr 11, 2026

@tfenne thanks for the PR! I think the updates and bug fixes make sense. Would you mind to address @copilot reviews before I approve and merge? thanks!

tfenne added 4 commits April 12, 2026 04:21
Two issues in SVDcalculator::ReadVcf genotype parsing:

1. The code selected a single FORMAT field strategy (PL, GL, or GT)
   per marker and used it for all samples. If PL was present in the
   FORMAT header but a specific sample had PL=., that sample was
   counted as missing even when a valid GT was available for it. This
   is common in GATK-generated multi-sample call sets where homref
   genotypes computed from ref blocks in gVCF may not get PLs.

2. The GT branch checked for 3 tokens after splitting the genotype by
   "|/", but a diploid genotype like "0/1" splits into 2 tokens, not
   3. This meant every GT-only sample was always counted as missing,
   and markers quickly exceeded the 20% missing-rate threshold.

Fix: look up PL, GL, and GT indices per marker, then for each sample
try them in priority order (PL > GL > GT), falling back when a field
is missing. The GT branch now correctly checks for 2 tokens and uses
a separate `alleles` variable rather than reusing `phred`.
Previously only the first token was checked for "." before converting
all values. Partially-missing fields like "0,.,." (PL/GL) or "./1" (GT)
would silently mis-parse because AsInteger()/AsDouble() return 0 for
".". Now all tokens in the triplet (PL, GL) or both alleles (GT) must
be non-missing for the sample to be considered parsed.
When a sample's genotype could not be parsed (all of PL, GL, GT
missing), the code incremented the missing counter but still fell
through to phred validation, clamping, and genotype assignment — using
whatever phred11/12/22 values were left from the previous sample. This
silently turned missing genotypes into copies of their neighbor.

Fix by moving phred variable declarations inside the per-sample loop so
they are always zero-initialized, and adding a `continue` after
incrementing the missing counter so unparsed samples retain their
default perMarkerGeno value of -1.
Adds a test VCF with 12 rows covering PL-only, GL-only, GT-only,
PL/GL/GT fallback, partial-missing fields (e.g. "50,.,."), phased
genotypes, and priority ordering. Each row encodes EXPECTED_AF and
EXPECTED_N_MISSING in the INFO field. The test calls ReadVcf directly
and validates computed AF and missing counts match expectations.
@tfenne tfenne force-pushed the fix-genotype-parsing-fallback branch from 220dd05 to 8ef3290 Compare April 12, 2026 10:51
@tfenne
Copy link
Copy Markdown
Contributor Author

tfenne commented Apr 12, 2026

Thanks @Griffan - I've rebased this PR branch on top of the merged commits on master, and added three commits, one each to address copilot's various points.

I wanted to keep the test relatively simple and tightly-scoped, so I've added a small test VCF with a wide variety of format fields and sample missingness, and verify that for each variant ReadVcf can read it and compute the correct number of missing samples and the correct allele frequency.

And thank you for merging my other PRs!

Copy link
Copy Markdown
Owner

@Griffan Griffan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much!

@Griffan Griffan merged commit f730b7c into Griffan:master Apr 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants