Skip to content

Commit 5b57ac9

Browse files
committed
feat(pool-manager): rework stuck-runner cleanup to eliminate OOMs and cut DB load
### Why * A production incident (`System.OutOfMemoryException` during `CheckForStuckRunners`) showed that we were loading **all** runners and their full `Lifecycle` collections into memory, then letting EF Core’s change-tracker churn on tens of thousands of entities. * Each cleanup cycle recreated that pressure, eventually crashing the autoscaler pod and stopping the host. ### What * **Query trimmed to SQL only** * Replaced `Include(...).AsEnumerable().Where(...)` with a *pure* LINQ-to-Entities filter: `LastState == Created && CreatedTime < now-10min`. * Removed unconditional `Include(x => x.Lifecycle)`. * Added `AsNoTracking()` and a **projection** to an anonymous type (`Select(r ⇒ new { … })`) so no full `Runner` entities are tracked. * **Context lifetime & tracking** * Scoped `ActionsRunnerContext` with `await using` – guarantees disposal after the method exits. * Disabled auto-detection of changes only for this batch (`ChangeTracker.AutoDetectChangesEnabled = false`) to minimise change-tracker work. * **Batch insert of lifecycle events** * Accumulate new `RunnerLifecycle` rows in an in-memory list and call `AddRange` once instead of adding per runner. * Single `SaveChangesAsync()` at the end → one DB round-trip. * **Queue deletion without materialising collections** * For every stuck runner enqueue a `DeleteRunnerTask` directly using the projected key data (no need for full entity). * **Logging** * Added explicit warning log per stuck runner **and** kept the original message structure for easy grepping. ### Result * `CheckForStuckRunners` now pulls only the small “actually stuck” set into memory and never tracks existing `Lifecycle` rows. **IN CLAUDE WE TRUST**
1 parent 14dad2f commit 5b57ac9

1 file changed

Lines changed: 33 additions & 17 deletions

File tree

PoolManager.cs

Lines changed: 33 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -146,34 +146,50 @@ private async Task CheckForStuckRunners(List<GithubTargetConfiguration> targetCo
146146
{
147147
// check the database for runners that are in "created" state for more then 5 minutes.
148148

149-
var db = new ActionsRunnerContext();
150-
foreach(var stuckRunner in db.Runners.Include(x => x.Lifecycle).AsEnumerable().Where(x => x.LastState == RunnerStatus.Created))
149+
await using var db = new ActionsRunnerContext();
150+
var cutoffTime = DateTime.UtcNow - TimeSpan.FromMinutes(10);
151+
152+
// Query stuck runners without loading lifecycle collections
153+
var stuckRunners = await db.Runners
154+
.AsNoTracking()
155+
.Where(x => x.LastState == RunnerStatus.Created && x.CreatedTime < cutoffTime)
156+
.Select(x => new { x.RunnerId, x.CloudServerId, x.Hostname, x.Cloud })
157+
.ToListAsync();
158+
159+
if (stuckRunners.Count == 0)
160+
return;
161+
162+
// Process stuck runners and create lifecycle entries
163+
var lifecycleEntries = new List<RunnerLifecycle>();
164+
165+
foreach(var stuckRunner in stuckRunners)
151166
{
152-
153-
// check if runner is old enough to be stuck
154-
if (stuckRunner.CreatedTime + TimeSpan.FromMinutes(10) > DateTime.UtcNow)
155-
continue;
167+
// Add to deletion queue
168+
_queues.DeleteTasks.Enqueue(new DeleteRunnerTask
169+
{
170+
ServerId = stuckRunner.CloudServerId,
171+
RunnerDbId = stuckRunner.RunnerId
172+
});
156173

157-
// Note stuckness in lifecycle and add runner to deletion queue
158-
stuckRunner.Lifecycle.Add(new RunnerLifecycle
174+
// Create lifecycle entry for batch insert
175+
lifecycleEntries.Add(new RunnerLifecycle
159176
{
177+
RunnerId = stuckRunner.RunnerId,
160178
Event = "Stuck in provisioning. Killing.",
161179
EventTimeUtc = DateTime.UtcNow,
162180
Status = RunnerStatus.Failure
163181
});
164-
165-
_queues.DeleteTasks.Enqueue(new DeleteRunnerTask
166-
{
167-
ServerId = stuckRunner.CloudServerId,
168-
RunnerDbId = stuckRunner.RunnerId
169-
});
170182

171183
_logger.LogWarning($"Killing Runner stuck in provisioning: {stuckRunner.Hostname} on {stuckRunner.Cloud}");
172-
173184
}
174185

175-
// write to DB
176-
await db.SaveChangesAsync();
186+
// Batch insert lifecycle entries without change tracking
187+
if (lifecycleEntries.Count > 0)
188+
{
189+
db.ChangeTracker.AutoDetectChangesEnabled = false;
190+
db.RunnerLifecycles.AddRange(lifecycleEntries);
191+
await db.SaveChangesAsync();
192+
}
177193
}
178194

179195
private async Task ProcessStats(List<GithubTargetConfiguration> targetConfig)

0 commit comments

Comments
 (0)