Skip to content

Use .arg instead of .trees#3152

Closed
benjeffery wants to merge 1 commit into
tskit-dev:mainfrom
benjeffery:arg
Closed

Use .arg instead of .trees#3152
benjeffery wants to merge 1 commit into
tskit-dev:mainfrom
benjeffery:arg

Conversation

@benjeffery
Copy link
Copy Markdown
Member

From a suggestion by @hyanwong on Slack. This could be advantageous as some folks don't realise tskit is an ARG library.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.58%. Comparing base (b6d7eab) to head (7b0cf06).
Report is 7 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3152   +/-   ##
=======================================
  Coverage   89.58%   89.58%           
=======================================
  Files          28       28           
  Lines       31841    31841           
  Branches     5849     5849           
=======================================
  Hits        28524    28524           
  Misses       1887     1887           
  Partials     1430     1430           
Flag Coverage Δ
c-tests 86.66% <ø> (ø)
lwt-tests 80.38% <ø> (ø)
python-c-tests 88.14% <ø> (ø)
python-tests 98.80% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 98.87% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on this, but we should document the change somewhere clearly (probably in several places), and make sure that there's wide support in the community before merging

Comment thread docs/file-formats.md
is not enforced in any way), and we will sometimes refer to them as ".trees"
By convention, these files are given the `.arg` suffix (although this
is not enforced in any way), and we will sometimes refer to them as ".arg"
files. We also refer to them as "tree sequence files".
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a "note" call out here or something noting that we changed the convention from .trees to .arg, also with some reassurances that you can keep any .trees or .ts files if you want, it doesn't make any actual difference.

@jeromekelleher
Copy link
Copy Markdown
Member

Were there any references to a ".ts" file?

@jeromekelleher
Copy link
Copy Markdown
Member

I guess we'd want to scan downstream repos like msprime and tszip etc to make sure the docs use this new convention (I wouldn't bother changing the actual code in test suites or anything, though)

@bhaller
Copy link
Copy Markdown

bhaller commented May 2, 2025

Hmm. This should be discussed before being merged, yes? @petrelharp you'll probably be interested. I see several caveats/objections here:

  • The .trees format is a specific file format, for tskit, not a general ARG format. The .arg suffix might be confusing for that reason; it sounds generic (like ".txt" for text), but in fact it is specific (like ".pdf" for text that is specifically in the PDF format).

  • Other software presumably represents ARGs in other ways; calling ours ".arg" would lead to an expectation that our format is some sort of universal interchange format (again, like ".txt"), or that tskit is the only way to represent an ARG, etc.

  • SLiM uses .trees all over (and indeed, ".trees" was first invented and used over on the SLiM side of things!), so if this change is going to be made, it involves SLiM as well. Probably it involves several other software packages in the tskit ecosystem too, right?

My preference is pretty strongly for .trees. I think it's descriptive and specific in a way that .arg isn't; and I don't really think changing the filename suffix that we use is going to really make much difference to how people perceive tskit and tree sequences; and I don't really want to change all the places that I talk about .trees files (in the manual, in the recipes, in the workshop including in voice recordings that I'd have to re-record, etc.). This seems like a lot of work for very little payoff – or perhaps even, I think, negative payoff.

@benjeffery
Copy link
Copy Markdown
Member Author

Hmm. This should be discussed before being merged, yes?

Absolutely! Thought the best way to trigger discussion was a PR to show what would be needed.

@benjeffery
Copy link
Copy Markdown
Member Author

Were there any references to a ".ts" file?

Not that I could see in this repo.

@jeromekelleher
Copy link
Copy Markdown
Member

Thanks @bhaller! First:

and make sure that there's wide support in the community before merging

We're definitely not going to push this through without discussion and broad agreement

A quick response on the fundamental point:

The .trees format is a specific file format, for tskit, not a general ARG format. The .arg suffix might be confusing for that reason; it sounds generic (like ".txt" for text), but in fact it is specific (like ".pdf" for text that is specifically in the PDF format).

Well, we would argue that yes it is a general ARG format. We wrote a paper making this very point at great length and detail. So, this initial reaction alone (a confusion about this basic point from someone deep in the community) makes me feel like the change is worth making. We want tskit to be seen as "the" ARG library, and this is a useful step in that direction. There is no other "general ARG format".

Please do read the paper - there's a lot in there which I don't really want to rehearse here.

At the end of the day, this is just a change in the suggested convention used for naming files. We don't have to be exclusive about it - we can just say that by convention tskit files are given the extension ".arg", ".trees" or ".ts". If you want to keep the documentation on SLiM using .trees that's totally fine.

@bhaller
Copy link
Copy Markdown

bhaller commented May 2, 2025

and make sure that there's wide support in the community before merging

We're definitely not going to push this through without discussion and broad agreement

Great.

A quick response on the fundamental point:

The .trees format is a specific file format, for tskit, not a general ARG format. The .arg suffix might be confusing for that reason; it sounds generic (like ".txt" for text), but in fact it is specific (like ".pdf" for text that is specifically in the PDF format).

Well, we would argue that yes it is a general ARG format. We wrote a paper making this very point at great length and detail. So, this initial reaction alone (a confusion about this basic point from someone deep in the community) makes me feel like the change is worth making. We want tskit to be seen as "the" ARG library, and this is a useful step in that direction. There is no other "general ARG format".

Please do read the paper - there's a lot in there which I don't really want to rehearse here.

OK. On vacation right now, with very little time before I need to leave my hotel; can't read the paper right now. :-> But it certainly seems like there are lots of other groups proposing lots of other file formats for information that is quite similar. It seems presumptuous to say "WE are _the_ARG library, forevermore". Maybe we'd like to establish ourselves as being that, sure; that's a goal to aspire to. But simply taking the crown for ourselves seems like it might rub many people the wrong way.

At the end of the day, this is just a change in the suggested convention used for naming files. We don't have to be exclusive about it - we can just say that by convention tskit files are given the extension ".arg", ".trees" or ".ts". If you want to keep the documentation on SLiM using .trees that's totally fine.

But then that's just confusing. People will constantly ask "so, what's the difference between these different file formats?" Apart from ".JPEG" vs. ".jpeg" vs. ".jpg" (which is already annoying and confusing), I can't think of a case where a single file format on disk is given multiple distinct file extensions that actually all mean the same thing. I think that's a recipe for confusion. And I don't think it really means that I won't need to change workshop materials, re-record lectures, etc., because I'll need to change my materials to try to avoid that confusion.

@bhaller
Copy link
Copy Markdown

bhaller commented May 2, 2025

For now, gotta go, will check it again this evening.

@jeromekelleher
Copy link
Copy Markdown
Member

Please do read the paper Ben - there's an extensive case built up to address exactly the points you're raising here. There's no hurry in responding, we're not going to merge.

@kitchensjn
Copy link
Copy Markdown

One thing to note is that ARGweaver outputs a ".arg" file with a different format which could be confusing especially if anyone is converting between tools.

@molpopgen
Copy link
Copy Markdown
Member

I don't have strong feelings here as long as a (specific) suffix is not a requirement. I'd probably continue to use .trees due to muscle memory.

The issue that @kitchensjn brings up about what to do w/outputs from multiple tools in the arg-o-sphere is a real problem.

@molpopgen
Copy link
Copy Markdown
Member

Were there any references to a ".ts" file?

I have used .tables (tables.dump output), as well as .ts and .trees (tree_sequence.dump output) at various points in my own workflows.

@petrelharp
Copy link
Copy Markdown
Contributor

I wasn't initially compelled by @bhaller's basic point. However, thinking more - I think a related point is that the name .arg is maybe too generic. Like one wouldn't want to use the .img suffix for one's image format (never mind that means something else), since there are many ways of storing images. Plus, just .arg is less identifiable and discoverable. This makes sense to me.

Nonetheless, I still kinda like the proposal, since as a relatively small field we want to settle actually on a single format and not have everyone inventing their own. Plus, the way tskit stores things is closer to "ARG" than it is to "sequence of trees".

Another option would be to come up with a suffix like maybe .tsarg? Then it'd have "ARG" in it, but also have it be more recognizable as being related to tskit?

@benjeffery
Copy link
Copy Markdown
Member Author

As we didn't have consensus on this PR, I'll close it. Thanks for all your input.

@benjeffery benjeffery closed this May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants