Skip to content

Download pseudos and make artifacts on-the-fly in make_artifacts.jl#16

Closed
azadoks wants to merge 3 commits intoJuliaMolSim:masterfrom
azadoks:on-the-fly
Closed

Download pseudos and make artifacts on-the-fly in make_artifacts.jl#16
azadoks wants to merge 3 commits intoJuliaMolSim:masterfrom
azadoks:on-the-fly

Conversation

@azadoks
Copy link
Copy Markdown
Collaborator

@azadoks azadoks commented Apr 17, 2026

This is kind of hacky but works!

I've broken the fixed Dojo v0.5.
I'll work on fixing it by hosting only the modified pseudos in the repo and writing a builder function in add_psedodojo.jl.

I see two benefits to doing it this way:

  1. Working with the repo is much nicer (no huge size on clone/checkout, no huge commits for new families)
  2. Won't run into problems with storage requirements on the repo when adding many versions / file formats / variants of new families

And a few drawbacks:

  1. Obviously, the pseudos are no longer stored here, and we're reliant on the pseudo owners/maintainers/distributors to keep their links alive
  2. I'm relying on the add_*.jl convention and the cli provided by the scripts
  3. I'm calling the add_*.jl scripts via the shell; requires precompilation each time

Release sizes are still capped to the Git LFS limit, but that would have been a problem in any case.

@azadoks azadoks requested a review from mfherbst April 17, 2026 10:01
@mfherbst
Copy link
Copy Markdown
Member

Sorry this commit is so huge (due to all the removed files) that I'm unable to see what you actually did. Would you point me to the relevant changes (with best direct links to lines in the files in your fork. Github understandably has issues if you remove 12M lines of code in one commit).

Given the above, take what I write with a grain of salt:
When setting up this repo I also thought to simply call the add scripts during artifact build. I decided not to do it to keep the mechanics as simple as possible. My point is that managing such a pseudo repo takes a lot of time effort and responsibility (and it's not fun science !) and the Julia community is small, so we should really make sure to not put load on future us.

I see your point about storage, but to me it has a clear benefit to have a "locked-in" version in a repo like this. In some of the parsing we do quite a lot (and take decisions) that should be reproducible. If all this happens in a CI run automagically, it gets very hard to figure out what went wrong if all of a sudden you get a different number when seemingly using the same pseudos. So broken magic here has potentially a huge impact on scientific outcome requiring some care and in my opinion therefore a human in the loop.

My main concern is your 1.. Given the state of the pseudo ecosystem I think it is very likely, close to 100%, that a repo will just disappear in the future. We definitely need resilience towards that.

Is storage such a big issue ? Can this not be solved by using multiple git subrepos that we control ?

@azadoks
Copy link
Copy Markdown
Collaborator Author

azadoks commented Apr 20, 2026

I guess storage is not the main issue for me per se but rather the pain of dealing with a repo with so many large files.

I definitely agree that we should guard ourselves against repos disappearing (see, e.g. old versions of the full GBRV table).

In this case maybe the best response is, as you say, subrepos.

@azadoks azadoks closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants