Commit 2557e6f
authored
Fix CloudFetch goroutine leak that retains Arrow buffers after Close (#357)
## Summary
Fixes #356.
Under high CloudFetch concurrency (≥6 simultaneous downloads), in-flight
`cloudFetchDownloadTask` goroutines could leak when the consumer closed
the iterator before draining all results. Each leaked goroutine pinned a
downloaded chunk in the Go heap, producing the multi-GiB heap plateau
described in the issue that only released on process restart.
## Root cause
`cloudFetchDownloadTask.Run` sends the download result on an
**unbuffered** channel without honoring context cancellation:
```go
cft.resultChan <- cloudFetchDownloadTaskResult{data: bytes.NewReader(buf), ...}
```
Sequence that triggers the leak:
1. `cloudIPCStreamIterator.Next` schedules `MaxDownloadThreads` (default
10) tasks concurrently.
2. The consumer dequeues task 1, gets its result, returns.
3. Tasks 2..N have completed their HTTP read in parallel and are now
**blocked** on the unbuffered send, holding their downloaded buffer.
4. The consumer abandons the iterator (timeout, error, early close,
etc.) and calls `iterator.Close()`.
5. `Close` calls `task.cancel()` on each remaining task. But context
cancellation does **not** unblock an in-flight channel send — the
goroutines stay blocked forever, retaining their buffers.
In v1.7.1 (the version the reporter is on) the goroutine had already
decoded the bytes into Arrow records *before* the send, so the leaked
memory was Arrow-allocator buffers — matching the stack trace in the
issue:
```
(*cloudFetchDownloadTask).Run.func1
getArrowRecords → (*ipc.Reader).Next → newRecord → loadArray
→ loadBinary → buffer → (*ipcSource).buffer → NewResizableBuffer
→ (*Buffer).Resize → (*GoAllocator).Allocate
```
In the current code (v1.11.0) the decode happens later in
`batchIterator.Next`, so the leak is the raw decompressed `buf` instead
— same shape, smaller per-goroutine retention, same plateau pattern.
## Fix
Route every channel send through a helper that selects on `ctx.Done()`:
```go
func (cft *cloudFetchDownloadTask) sendResult(result cloudFetchDownloadTaskResult) {
select {
case cft.resultChan <- result:
case <-cft.ctx.Done():
}
}
```
`cloudIPCStreamIterator.Close` already calls `task.cancel()` for every
queued task, so cancellation now correctly drains stuck goroutines and
lets their buffers be GC'd.
## Test plan
- [x] New unit test
`TestCloudFetchIterator_CloseReleasesInFlightDownloads` reproduces the
leak: spawns `MaxDownloadThreads` concurrent downloads, releases them
after the iterator has consumed only the first, then calls `Close()` and
asserts that no `cloudFetchDownloadTask.Run` goroutines remain.
- Fails on `main` (~9 leaked goroutines after `Close`).
- Passes with this change.
- [x] Full `go test ./...` passes locally.
- [x] `go vet` and `gofmt` clean.
## Who is affected
Any user with CloudFetch enabled (default since v1.7.0) whose query
context can be cancelled or whose result set can be abandoned mid-stream
— i.e., basically everyone running large CloudFetch queries with
timeouts.
This pull request and its description were written by Isaac.
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>1 parent f4d9992 commit 2557e6f
2 files changed
Lines changed: 111 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
297 | 297 | | |
298 | 298 | | |
299 | 299 | | |
300 | | - | |
| 300 | + | |
301 | 301 | | |
302 | 302 | | |
303 | 303 | | |
| |||
306 | 306 | | |
307 | 307 | | |
308 | 308 | | |
309 | | - | |
| 309 | + | |
310 | 310 | | |
311 | 311 | | |
312 | 312 | | |
| |||
316 | 316 | | |
317 | 317 | | |
318 | 318 | | |
319 | | - | |
| 319 | + | |
320 | 320 | | |
321 | 321 | | |
322 | 322 | | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
323 | 334 | | |
324 | 335 | | |
325 | 336 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
| 10 | + | |
9 | 11 | | |
| 12 | + | |
10 | 13 | | |
11 | 14 | | |
12 | 15 | | |
| |||
604 | 607 | | |
605 | 608 | | |
606 | 609 | | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
0 commit comments