Commit 9dd0b39
authored
fix: optimize sandbox cleaning performance (#796)
* fix: optimize sandbox cleaning query performance
Split the single OR-based deletion query into two targeted queries:
1. Assigned but orphaned sandboxes — fast with Location key + NOT EXISTS
(~4.7s on prod vs 140s before)
2. Unassigned sandboxes older than 15 days — benefits from new
idx_assigned_lastaccesstime composite index
Additional changes:
- Add composite index on (Assigned, LastAccessTime) for the unassigned
query
- Fetch distinct SEName values upfront to enable the Location unique key
- Replace non-sargable days_since(LastAccessTime) with a pre-computed
threshold (LastAccessTime < NOW() - 15 days)
- Increase default batch_size from 500 to 5000
- Chunk S3 bulk deletes into groups of 1000 (S3 API limit)
* fix: split workers across both deletion query types
Half the workers prefer orphaned sandboxes first, half prefer
unassigned first. This ensures both query types make progress
concurrently instead of one starving the other.
* fix: use cursor-based pagination for sandbox deletion queries
Track the last SBId seen per worker and add SBId > cursor to each
query. This avoids re-scanning millions of already-checked rows on
every batch, turning O(total_rows) scans into O(batch_size) seeks.
* fix: simplify to single worker with parallel S3 deletions
Replace multi-worker approach with a single sequential query loop
and larger batches (50k). S3 delete chunks are now sent concurrently
via TaskGroup. This eliminates lock contention between workers while
keeping S3 throughput high.
* fix: separate row selection from locking to avoid mass locking
Split the SELECT FOR UPDATE into two steps:
1. SELECT without locking to find candidate SBIds (scans many rows)
2. SELECT FOR UPDATE SKIP LOCKED WHERE SBId IN (...) to lock only
the exact matching rows by primary key
Previously FOR UPDATE locked every row examined during the scan
(82M rows for a 50k result), causing 143MB of lock memory and
12+ minute DELETE times. Now only the returned rows are locked.
* fix: remove row locking from sandbox cleanup to avoid lock memory bloat
Split the single long transaction into three phases: SELECT candidates
(short transaction, no locks), S3 delete (no transaction), and DB DELETE
(short transaction). This eliminates the SELECT FOR UPDATE that locked
every row examined during the primary key scan.
* fix: chunk DB deletes into 1000-row batches with concurrent execution
A single DELETE of 50k rows caused MySQL to lock 30M rows during the
index scan. Split into 1000-row chunks running up to 10 concurrently
via asyncio.Semaphore, keeping each transaction short.
* refactor: add per-phase timing to sandbox cleanup logging
Log duration of each phase (SELECT, S3 delete, DB delete) per batch
so bottlenecks are immediately visible. Demote per-chunk DB delete
log to DEBUG to reduce noise.
* fix: increase S3 connection pool to 50 for parallel bulk deletes
The default max_pool_connections=10 serialized 50 concurrent delete
requests into 5 rounds, adding ~4s latency per round against Ceph.
* fix: make S3 connection pool size configurable
Add s3_max_pool_connections setting (default 50) to SandboxStoreSettings,
configurable via DIRACX_SANDBOX_STORE_S3_MAX_POOL_CONNECTIONS.
* fix: stop iterating when batch returns fewer than batch_size candidates
Avoids redundant SELECT queries when the previous batch already
exhausted all matching rows.
* fix: handle S3 partial delete failures and pass se_name directly
s3_bulk_delete_with_retry now returns failed keys instead of raising,
and _s3_delete_chunk_with_retry retries only the failed keys. The
caller skips DB deletion for sandboxes whose S3 objects were not
removed, preventing dark data.
Also removes the per-batch SELECT DISTINCT SEName query by passing
settings.se_name directly, and updates the stale docstring.
* fix: remove idx_assigned_lastaccesstime index from schema
This index is not needed for the current cleanup queries.
* fix: make sandbox cleaning parameters configurable via settings
Move hardcoded batch_size, delete_chunk_size, and semaphore concurrency
values into SandboxStoreSettings so they can be tuned per deployment.
* fix: remove redundant default values from settings docstrings
The default values were duplicated in the docstring prose when they are
already visible from the field definitions and the generated docs.1 parent c94dad5 commit 9dd0b39
6 files changed
Lines changed: 270 additions & 93 deletions
File tree
- diracx-core/src/diracx/core
- diracx-db/src/diracx/db/sql/sandbox_metadata
- diracx-logic
- src/diracx/logic/jobs
- tests/jobs
- docs/admin/reference
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
93 | 120 | | |
94 | 121 | | |
| 122 | + | |
95 | 123 | | |
96 | 124 | | |
97 | 125 | | |
98 | 126 | | |
99 | | - | |
| 127 | + | |
100 | 128 | | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
| 129 | + | |
105 | 130 | | |
106 | | - | |
| 131 | + | |
107 | 132 | | |
108 | 133 | | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
269 | 269 | | |
270 | 270 | | |
271 | 271 | | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
272 | 298 | | |
273 | 299 | | |
274 | 300 | | |
275 | 301 | | |
276 | 302 | | |
277 | 303 | | |
278 | 304 | | |
279 | | - | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
280 | 309 | | |
281 | 310 | | |
282 | 311 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
14 | 13 | | |
15 | 14 | | |
16 | 15 | | |
| |||
24 | 23 | | |
25 | 24 | | |
26 | 25 | | |
27 | | - | |
| 26 | + | |
28 | 27 | | |
29 | 28 | | |
30 | 29 | | |
| |||
209 | 208 | | |
210 | 209 | | |
211 | 210 | | |
212 | | - | |
213 | | - | |
214 | | - | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
215 | 218 | | |
216 | | - | |
217 | | - | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
218 | 225 | | |
219 | 226 | | |
| 227 | + | |
220 | 228 | | |
| 229 | + | |
221 | 230 | | |
222 | 231 | | |
223 | | - | |
224 | | - | |
| 232 | + | |
| 233 | + | |
225 | 234 | | |
226 | 235 | | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
227 | 241 | | |
228 | | - | |
| 242 | + | |
229 | 243 | | |
| 244 | + | |
230 | 245 | | |
231 | 246 | | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
232 | 255 | | |
233 | | - | |
234 | | - | |
235 | 256 | | |
236 | | - | |
237 | | - | |
238 | | - | |
239 | | - | |
240 | | - | |
241 | | - | |
242 | 257 | | |
243 | | - | |
244 | | - | |
245 | | - | |
246 | | - | |
247 | | - | |
248 | | - | |
249 | | - | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
250 | 269 | | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
251 | 274 | | |
252 | | - | |
253 | | - | |
254 | | - | |
255 | | - | |
256 | | - | |
| 275 | + | |
| 276 | + | |
257 | 277 | | |
258 | | - | |
| 278 | + | |
| 279 | + | |
259 | 280 | | |
260 | 281 | | |
261 | 282 | | |
| |||
272 | 293 | | |
273 | 294 | | |
274 | 295 | | |
275 | | - | |
| 296 | + | |
276 | 297 | | |
0 commit comments