database: treat database as read-only after creation and use shared locks#315
database: treat database as read-only after creation and use shared locks#315fischeti wants to merge 6 commits into
Conversation
a6ac0ee to
9a16348
Compare
|
Actually, I am not sure anymore if the locks are needed. I have the impression that git should be able to handle concurrency itself of a shared git database i.e. objects are immutable and do not need to be locked, while other things like refs seem to have fine-grained lock files: https://stackoverflow.com/questions/19962024/locking-strategy-of-git-to-achieve-concurrency Edit: I guess the |
I think we should preserve locks, and I like your solution to speed things up (still need to review the details). The main issue I'm attempting to solve is with multiple repos fetching the same bare ref. Concurrent fetches racing on the same ref would produce |
|
Yes, you are right🤓 I just realized that git locks are not blocking but just error out |
Tim's commit e3e9241 disables auto-gc only when bender creates a brand new bare database. Databases populated by bender 0.31.x or earlier still run with git's default gc.auto=6700, so an auto-gc could fire from a fetch and prune objects referenced by a concurrent `git clone --shared` checkout via alternates. Apply `git config gc.auto 0` in the Fetch branch too: idempotent, runs under the same exclusive lock, and ensures every database the new bender ever fetches into is safe -- not only ones it initialized itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shared db lock in `checkout_git` is keyed by the dependency URL and shared with other projects, so it intentionally does not serialize two bender invocations that resolve to the same working-tree path (e.g. two `bender checkout` runs against the same project root). Shared lockers don't block each other, so both invocations could interleave `remove_dir_all`, `git clone --shared`, and `git checkout` on the same directory and corrupt it. Add a per-checkout-path exclusive lock as the first I/O in `checkout_git`. The lock file lives as `<path>.lock` -- a sibling of the checkout dir, so it stays with the project and is independent of where the bare database is stored. Unrelated deps and the same dep across different projects still resolve to distinct paths and run in parallel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tim's e3e9241 dropped the bender-tmp- filter from the RevisionNotOnUpstream check on the assumption that the synthetic tags were no longer created. They still are -- `dependency_versions` creates them during update operations to make commits reachable for `git rev-list --all`, and databases populated by bender 0.31.x also contain leftover tmp tags from the old checkout flow. Without the filter the warning was silently suppressed whenever such a tag pointed at the revision being checked out, even when no real upstream ref actually tracked the commit. Restore the explicit filter. Marked as a TODO -- the longer-term fix is to either stop creating these tags or sweep them after use; until then the filter keeps the warning correct on every database shape bender can encounter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
micprog
left a comment
There was a problem hiding this comment.
I added some additional commits on top to add additional protections and improve reporting. Offline we discussed cleaning up some of the sloppy comments, other than that LGTM 👍
In #307 we introduced database locks and separaring database from checkout in order to share the database as a cache cross projects. This is also relevant in the context of CI, since the git databases are only fetched once and can then be reused. However, the exclusive locks on the databases essentially prevent any kind of parallelization and every invocation of bender becomes serialized across CI jobs.
This PR aims to relax the locks to use exclusive locks only when writing to the database, and use shared locks when reading from it.
The flow of fetching and checking out a git repository was slightly adapted to reduce the amount of write operations on a shared database:
git initandgit fetch [<rev>]are write operations and still acquire an exclusive locktmp-<hash>in the database and the later checkout withgit clone --branch tmp-<hash>has been replaced withgit clone --shared --no-checkout && git checkout <rev>git gcsince revisions might not be tracked anymore by refs i.e. the previoustmp-<hash>tags. This is maybe overly catious.Furthermore, file locking is done more fine grained. Instead of locking the whole
git_databasecall exclusively, only the part that requires writes (git initandgit fetch) acquire an exclusive lock (if database is not ready yet).