jobs/index: Add SquashIndexViaApi job#13517
Conversation
Store the GitHub API base URL in a `Url` field instead of hardcoding `https://api.github.com` inside `_request`, and route all requests through `base_url.join(...)`. `new()` still defaults to the production URL; a private `with_base_url` constructor is introduced so upcoming mockito-based tests can point the client at a local mock server. No behavior change for production callers.
Add `mockito` as a dev-dependency and introduce a `tests` module that points `RealGitHubClient` at a local mock server via `with_base_url`. A first test covers `get_user`, exercising the bearer-auth path and verifying that requests hit the configured base URL. This paves the way for covering additional endpoints without reaching out to the real GitHub API.
These let us read a repository's current ref state and commit/tree SHAs over the GitHub REST API instead of via a local git clone. They take no authentication since the crates.io index repos are public and the 60/hour unauthenticated limit is more than sufficient for this use case. A new `examples/fetch_ref.rs` binary exercises both methods end-to-end against `api.github.com`.
Adds `create_commit`, `create_ref`, and `update_ref` to `GitHubClient`, all authenticated via an `&AccessToken`. `update_ref` strips a leading `refs/` when building the URL; `create_ref` sends the fully qualified ref in the body. Writes share a new private `_mutate` helper that mirrors the `apply_auth` closure shape of `_request`.
Adds a `github: Arc<dyn GitHubClient>` field to `Environment` and plumbs it through the background worker binary and the test harness. `App.github` changes from `Box` to `Arc` so the same client instance can flow into both `App` and `Environment` in tests without cloning the underlying mock. No consumer yet; `SquashIndexViaApi` will use it in the next commit.
Collapses the index via GitHub's REST Git Database API (`get_ref` -> `get_commit` -> `create_commit` -> `create_ref` -> drift re-check -> `update_ref`) instead of shelling out to `git push`. Avoids the pack generation that has been OOMing the production worker, since the server already has the objects. The existing `SquashIndex` job stays in place as a fallback; both share the `repository` queue so index-writing jobs serialize even though this one does not touch the local bare repo. Two integration tests cover the happy path and the drift-check bail.
Mirrors the existing `SquashIndex` subcommand so operators can manually enqueue the API-driven squash via `crates-admin enqueue-job squash_index_via_api`.
|
as mentioned in https://rust-lang.zulipchat.com/#narrow/channel/318791-t-crates-io/topic/index.20squashing/near/590951284, I've successfully tested this PR on our staging environment. the staging index is significantly smaller than the production index, but I'm fairly confident that the same logic should also work on production. if not, we can still fall back to the original background job for now. |
There was a problem hiding this comment.
There's quite a lot here. I think I understand how it all fits together, but I don't think I'd want to have to pass an exam that deep dived into it.
If there's one thing that just worries me a tad here, it's the possibility of GitHub itself drifting in terms of API behaviour. The test coverage here is excellent, but relies on mocks, and this feels like a part of the GitHub API that is lower level than most users ever touch. I don't practically know how we'd test it without significant effort — we'd presumably need a smoke test type thing with a real repo in a known state that we could manipulate, and it's probably not worth it, but I like to write down my vague worries just in case.
Anyway, since this has been tested on staging already, I'm happy for us to give it a go.
| let squash_start = Instant::now(); | ||
| let input = CreateCommit { | ||
| message: &message, | ||
| tree: &tree_sha, | ||
| parents: &[], | ||
| }; | ||
| let new_commit = github.create_commit(&owner, &repo, &input, &auth).await?; |
There was a problem hiding this comment.
It took me a few read-throughs to understand exactly why this works — the fact tree objects are logically separate from commits is one of those things that you (well, I, at least) don't tend to think about much as an explicit part of the data model that underpins Git.
Perhaps we could have a brief comment somewhere in here explaining the theory for the benefit of future crates.io maintainers? (Or, more likely, me in six months.)
There was a problem hiding this comment.
I've added 097eeba which should hopefully make this clearer :)
Co-authored-by: Adam Harvey <adam@adamharvey.name>
Move the explanatory doc comment onto the struct itself and expand it to spell out how blobs, trees and commits relate in Git's object model, so a future reader can see at a glance why reusing the existing tree SHA on a fresh parentless commit avoids any blob or pack work.
The existing
SquashIndexjob has been OOMing the production worker becausegit pushregenerates the full pack on every squash. This adds an alternative job that drives the same squash through GitHub's API (get_ref->get_commit->create_commit->create_ref-> drift re-check ->update_ref), so the server can reuse objects it already has instead of receiving a fresh pack.The original
SquashIndexjob stays in place as a fallback. Both share therepositoryqueue so index-writing jobs continue to serialize, even though the new job does not write to the local bare repo.Along the way the
crates_io_githubcrate gains:parse_github_slug()helperRealGitHubClientplus a mockito-based test harnessget_ref()/get_commit()trait methods (unauthenticated, sufficient for the public index repos)create_commit()/create_ref()/update_ref()write methods authenticated viaAccessTokenEnvironmentnow carries anArc<dyn GitHubClient>so jobs can talk to GitHub directly, andcrates-admin enqueue-jobexposes the new job for manual operator use.Related
ArchiveIndexBranchbackground job #13472