Skip to content

jobs/index: Add SquashIndexViaApi job#13517

Merged
Turbo87 merged 11 commits into
rust-lang:mainfrom
Turbo87:squash-via-api
Apr 28, 2026
Merged

jobs/index: Add SquashIndexViaApi job#13517
Turbo87 merged 11 commits into
rust-lang:mainfrom
Turbo87:squash-via-api

Conversation

@Turbo87
Copy link
Copy Markdown
Member

@Turbo87 Turbo87 commented Apr 27, 2026

The existing SquashIndex job has been OOMing the production worker because git push regenerates the full pack on every squash. This adds an alternative job that drives the same squash through GitHub's API (get_ref -> get_commit -> create_commit -> create_ref -> drift re-check -> update_ref), so the server can reuse objects it already has instead of receiving a fresh pack.

The original SquashIndex job stays in place as a fallback. Both share the repository queue so index-writing jobs continue to serialize, even though the new job does not write to the local bare repo.

Along the way the crates_io_github crate gains:

  • parse_github_slug() helper
  • A configurable base URL on RealGitHubClient plus a mockito-based test harness
  • get_ref() / get_commit() trait methods (unauthenticated, sufficient for the public index repos)
  • create_commit() / create_ref() / update_ref() write methods authenticated via AccessToken

Environment now carries an Arc<dyn GitHubClient> so jobs can talk to GitHub directly, and crates-admin enqueue-job exposes the new job for manual operator use.

Related

Turbo87 added 9 commits April 27, 2026 20:19
Store the GitHub API base URL in a `Url` field instead of hardcoding
`https://api.github.com` inside `_request`, and route all requests
through `base_url.join(...)`. `new()` still defaults to the production
URL; a private `with_base_url` constructor is introduced so upcoming
mockito-based tests can point the client at a local mock server.

No behavior change for production callers.
Add `mockito` as a dev-dependency and introduce a `tests` module that
points `RealGitHubClient` at a local mock server via `with_base_url`.
A first test covers `get_user`, exercising the bearer-auth path and
verifying that requests hit the configured base URL.

This paves the way for covering additional endpoints without reaching
out to the real GitHub API.
These let us read a repository's current ref state and commit/tree SHAs
over the GitHub REST API instead of via a local git clone. They take no
authentication since the crates.io index repos are public and the
60/hour unauthenticated limit is more than sufficient for this use
case.

A new `examples/fetch_ref.rs` binary exercises both methods end-to-end
against `api.github.com`.
Adds `create_commit`, `create_ref`, and `update_ref` to `GitHubClient`,
all authenticated via an `&AccessToken`. `update_ref` strips a leading
`refs/` when building the URL; `create_ref` sends the fully qualified
ref in the body. Writes share a new private `_mutate` helper that
mirrors the `apply_auth` closure shape of `_request`.
Adds a `github: Arc<dyn GitHubClient>` field to `Environment` and
plumbs it through the background worker binary and the test harness.
`App.github` changes from `Box` to `Arc` so the same client instance
can flow into both `App` and `Environment` in tests without cloning
the underlying mock.

No consumer yet; `SquashIndexViaApi` will use it in the next commit.
Collapses the index via GitHub's REST Git Database API (`get_ref` ->
`get_commit` -> `create_commit` -> `create_ref` -> drift re-check ->
`update_ref`) instead of shelling out to `git push`. Avoids the pack
generation that has been OOMing the production worker, since the
server already has the objects.

The existing `SquashIndex` job stays in place as a fallback; both
share the `repository` queue so index-writing jobs serialize even
though this one does not touch the local bare repo.

Two integration tests cover the happy path and the drift-check bail.
Mirrors the existing `SquashIndex` subcommand so operators can
manually enqueue the API-driven squash via
`crates-admin enqueue-job squash_index_via_api`.
@Turbo87 Turbo87 added the C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear label Apr 27, 2026
@Turbo87
Copy link
Copy Markdown
Member Author

Turbo87 commented Apr 27, 2026

as mentioned in https://rust-lang.zulipchat.com/#narrow/channel/318791-t-crates-io/topic/index.20squashing/near/590951284, I've successfully tested this PR on our staging environment. the staging index is significantly smaller than the production index, but I'm fairly confident that the same logic should also work on production. if not, we can still fall back to the original background job for now.

@Turbo87 Turbo87 requested a review from a team April 27, 2026 18:31
Copy link
Copy Markdown
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a lot here. I think I understand how it all fits together, but I don't think I'd want to have to pass an exam that deep dived into it.

If there's one thing that just worries me a tad here, it's the possibility of GitHub itself drifting in terms of API behaviour. The test coverage here is excellent, but relies on mocks, and this feels like a part of the GitHub API that is lower level than most users ever touch. I don't practically know how we'd test it without significant effort — we'd presumably need a smoke test type thing with a real repo in a known state that we could manipulate, and it's probably not worth it, but I like to write down my vague worries just in case.

Anyway, since this has been tested on staging already, I'm happy for us to give it a go.

View changes since this review

Comment thread src/worker/jobs/index/squash.rs Outdated
Comment thread src/worker/jobs/index/squash.rs Outdated
Comment on lines +137 to +143
let squash_start = Instant::now();
let input = CreateCommit {
message: &message,
tree: &tree_sha,
parents: &[],
};
let new_commit = github.create_commit(&owner, &repo, &input, &auth).await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a few read-throughs to understand exactly why this works — the fact tree objects are logically separate from commits is one of those things that you (well, I, at least) don't tend to think about much as an explicit part of the data model that underpins Git.

Perhaps we could have a brief comment somewhere in here explaining the theory for the benefit of future crates.io maintainers? (Or, more likely, me in six months.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added 097eeba which should hopefully make this clearer :)

Turbo87 and others added 2 commits April 28, 2026 08:21
Co-authored-by: Adam Harvey <adam@adamharvey.name>
Move the explanatory doc comment onto the struct itself and expand it
to spell out how blobs, trees and commits relate in Git's object model,
so a future reader can see at a glance why reusing the existing tree
SHA on a fresh parentless commit avoids any blob or pack work.
@Turbo87 Turbo87 enabled auto-merge April 28, 2026 06:28
@Turbo87 Turbo87 merged commit 542b8e5 into rust-lang:main Apr 28, 2026
13 checks passed
@Turbo87 Turbo87 deleted the squash-via-api branch April 28, 2026 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants