Skip to content

Add sample_nodes_by_ploidy function#3157

Merged
jeromekelleher merged 1 commit into
tskit-dev:mainfrom
benjeffery:split-samples
May 12, 2025
Merged

Add sample_nodes_by_ploidy function#3157
jeromekelleher merged 1 commit into
tskit-dev:mainfrom
benjeffery:split-samples

Conversation

@benjeffery
Copy link
Copy Markdown
Member

No description provided.

@benjeffery benjeffery force-pushed the split-samples branch 4 times, most recently from 7bedfae to e8fa19a Compare May 7, 2025 12:06
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.59%. Comparing base (6542fc2) to head (a12aab9).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3157   +/-   ##
=======================================
  Coverage   89.58%   89.59%           
=======================================
  Files          28       28           
  Lines       31885    31895   +10     
  Branches     5855     5857    +2     
=======================================
+ Hits        28565    28575   +10     
  Misses       1888     1888           
  Partials     1432     1432           
Flag Coverage Δ
c-tests 86.66% <ø> (ø)
lwt-tests 80.38% <ø> (ø)
python-c-tests 88.18% <ø> (ø)
python-tests 98.79% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 98.84% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can we do a follow-up commit where we use the new functions in the vcf() method? (Just to see if we're implementing the right interface?)

Comment thread python/tskit/trees.py Outdated
def sample_nodes_by_ploidy(self, ploidy):
"""
Returns an 2D array of node IDs, where each row has length `ploidy`.
This is useful when inidividuals are not defined in the tree sequence
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is useful when inidividuals are not defined in the tree sequence
This is useful when individuals are not defined in the tree sequence

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sprinkle these in to prove it's not AI. Fixed.

@benjeffery
Copy link
Copy Markdown
Member Author

benjeffery commented May 7, 2025

I've used the new functions in the VCF code - I've tried to keep the tests unchanged with a couple of exceptions:

  • A more descriptive error when samples % ploidy != 0
  • Mixed sample and non-sample nodes are now allowed.
  • It is now allowed that some sample nodes are not associated with individuals.

@benjeffery
Copy link
Copy Markdown
Member Author

Huh, test_vcf.py is passing, but I didn't check test_cli.py locally. Digging in.

@benjeffery
Copy link
Copy Markdown
Member Author

benjeffery commented May 7, 2025

Hmm, some of the error semantics in the existing code are a bit odd. I've replicated them for compatibility. For example an individual with no nodes is fine, but not if that individual was specifically asked for. This was only tested in the CLI code.

@jeromekelleher
Copy link
Copy Markdown
Member

I think it's OK to relax the input requirements a bit now. Making things that used to raise an error work as expected now seems like forward progress to me, and it seems unlikely we'll break much code by doing that (people don't tend to rely on errors much). We just need to document the changes.

@benjeffery
Copy link
Copy Markdown
Member Author

Ok, I've simplified things so that the error is only triggered when all the individuals specified have no nodes.

@jeromekelleher
Copy link
Copy Markdown
Member

What's the output for an individual with no nodes then, missing data?

We should document these corner cases now, in the vcf method or it'll be forgotten again

@benjeffery
Copy link
Copy Markdown
Member Author

I've added some changes to the docs and a changelog.

Copy link
Copy Markdown
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've got the semantics quite right here - it would be surprising if I specified individuals=[0, 1] and I got back a VCF with zero samples in it. In this case, I think we should raise an error.

Comment thread python/tskit/trees.py Outdated
that is a sample and another that is not) in the data model will
result in an error by default. However, such individuals can be
excluded using the ``individuals`` argument.
using the ``individuals`` argument. And individual specified that
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Comment thread python/CHANGELOG.rst Outdated

**Changes**

- Allow mixed sample and non-sample nodes to be output by ```write_vcf``,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Allow mixed sample and non-sample nodes to be output by ```write_vcf``,
- Allow mixed sample and non-sample nodes to be output by ```write_vcf``.

Comment thread python/CHANGELOG.rst Outdated
Comment on lines +8 to +9
and for individuals without nodes to be specified for output, although
they will not appear in the VCF.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and for individuals without nodes to be specified for output, although
they will not appear in the VCF.
Individuals without nodes no longer raise an error and are instead ignored,
except in the case where they are explicitly specified in the individuals argument.

@benjeffery benjeffery force-pushed the split-samples branch 2 times, most recently from 7d64905 to 754a765 Compare May 8, 2025 12:32
@benjeffery
Copy link
Copy Markdown
Member Author

Fixed up.

Comment thread python/CHANGELOG.rst Outdated
Comment thread python/tskit/trees.py Outdated
result in an error by default. However, such individuals can be
excluded using the ``individuals`` argument.
using the ``individuals`` argument. An individual specified that
is not associated with any nodes will be ignored and not included in the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true any more

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was - I added a test in 266f0cb to check.

@jeromekelleher
Copy link
Copy Markdown
Member

I'm confused. I thought the semantics was:

  • If there's an individual in the individuals table that has no nodes, and ts.vcf(individuals=None), then that individual gets ignored in the output
  • If that individual is references in ts.vcf(individuals=some_list) then an error is raised

That's what the release notes say to me, but the docstring is not.

@benjeffery
Copy link
Copy Markdown
Member Author

The code and docstring are "An individual specified that is not associated with any nodes will be ignored and not included in the output." The changelog before revisions was "allow individuals without nodes to be specified for output, although they will not appear in the VCF" which is the same thing. I didn't read your revision close enough to realise it meant the opposite. Happy to change the code if you think the erroring on nodeless, specified, individual is better.

@jeromekelleher
Copy link
Copy Markdown
Member

Yes please

Comment thread python/tskit/trees.py Outdated
This is useful when individuals are not defined in the tree sequence
so `TreeSequence.individuals_nodes` cannot be used. The samples are
placed in the array in the order which they are found in the node
table. Ploidy must be a multiple of the number of sample nodes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
table. Ploidy must be a multiple of the number of sample nodes.
table. The number of sample nodes must be a multiple of ploidy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@benjeffery benjeffery force-pushed the split-samples branch 4 times, most recently from 9a8eaec to 7bcf8a6 Compare May 8, 2025 22:34
@benjeffery
Copy link
Copy Markdown
Member Author

Ok, have tried to be really careful this time! It is now an error again to specify nodeless individuals. I've also realised that it needs to be stated in the changelog that it is now allowed to have sample nodes that have no individual.

Copy link
Copy Markdown
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the semantics are still pretty weird - two of the tests look like the same thing to me, with different errors. See suggested new semantics.

Comment thread python/tests/test_vcf.py Outdated
with pytest.raises(ValueError, match="is not associated with any nodes"):
ts.as_vcf(
allow_position_zero=True,
individuals=list(range(ts.num_individuals)) + [ts.num_individuals - 1],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last bit is redundant isn't it?

Comment thread python/tskit/vcf.py Outdated
"Sample nodes must either all be associated with individuals "
"or not associated with any individuals"
)
# If there are no individuals, or all the individuals are not associated with
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this simpler - just

if ploidy is not None and individuals is not None:
    raise ValueError("...")
if individuals is None:

    if ts.num_individuals > 0:
         individuals_nodes = ts.individuals_nodes
    else:
         ploidy = 1 if ploidy is None else ploidy
         individuals_nodes = ts.sample_nodes_by_ploidy(ploidy)

else:
      individuals_nodes = ts.individuals_nodes[individuals]

@benjeffery
Copy link
Copy Markdown
Member Author

As discussed, I've removed the VCF code changes here to reduce this to solely the split_samples_by_ploidy function.

@jeromekelleher jeromekelleher added this pull request to the merge queue May 12, 2025
Merged via the queue into tskit-dev:main with commit c914b8b May 12, 2025
19 checks passed
@benjeffery benjeffery deleted the split-samples branch May 12, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants