feat(pool-manager): rework stuck-runner cleanup to eliminate OOMs #5
Merged
Conversation
… cut DB load
### Why
* A production incident (`System.OutOfMemoryException` during `CheckForStuckRunners`) showed that we were
loading **all** runners and their full `Lifecycle` collections into memory, then letting EF Core’s
change-tracker churn on tens of thousands of entities.
* Each cleanup cycle recreated that pressure, eventually crashing the autoscaler pod and stopping the host.
### What
* **Query trimmed to SQL only**
* Replaced `Include(...).AsEnumerable().Where(...)` with a *pure* LINQ-to-Entities
filter:
`LastState == Created && CreatedTime < now-10min`.
* Removed unconditional `Include(x => x.Lifecycle)`.
* Added `AsNoTracking()` and a **projection** to an anonymous type
(`Select(r ⇒ new { … })`) so no full `Runner` entities are tracked.
* **Context lifetime & tracking**
* Scoped `ActionsRunnerContext` with `await using` – guarantees disposal after
the method exits.
* Disabled auto-detection of changes only for this batch
(`ChangeTracker.AutoDetectChangesEnabled = false`) to minimise
change-tracker work.
* **Batch insert of lifecycle events**
* Accumulate new `RunnerLifecycle` rows in an in-memory list and call
`AddRange` once instead of adding per runner.
* Single `SaveChangesAsync()` at the end → one DB round-trip.
* **Queue deletion without materialising collections**
* For every stuck runner enqueue a `DeleteRunnerTask` directly using the
projected key data (no need for full entity).
* **Logging**
* Added explicit warning log per stuck runner **and** kept the original message
structure for easy grepping.
### Result
* `CheckForStuckRunners` now pulls only the small “actually stuck” set into
memory and never tracks existing `Lifecycle` rows.
**IN CLAUDE WE TRUST**
Contributor
Author
Noticed this and saw the pod and restarted many many times. I can't do .NET but thought I'd let Claude have a crack and let you know :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
System.OutOfMemoryExceptionduringCheckForStuckRunners) showed that we were loading all runners and their fullLifecyclecollections into memory, then letting EF Core’s change-tracker churn on tens of thousands of entities.What
Include(...).AsEnumerable().Where(...)with a pure LINQ-to-Entities filter:LastState == Created && CreatedTime < now-10min.Include(x => x.Lifecycle).AsNoTracking()and a projection to an anonymous type (Select(r ⇒ new { … })) so no fullRunnerentities are tracked.ActionsRunnerContextwithawait using– guarantees disposal after the method exits.ChangeTracker.AutoDetectChangesEnabled = false) to minimise change-tracker work.RunnerLifecyclerows in an in-memory list and callAddRangeonce instead of adding per runner.SaveChangesAsync()at the end → one DB round-trip.DeleteRunnerTaskdirectly using the projected key data (no need for full entity).Result
CheckForStuckRunnersnow pulls only the small “actually stuck” set into memory and never tracks existingLifecyclerows.IN CLAUDE WE TRUST