Commit 2a76760
Fix race condition between chunk release and download error handling (#1407)
## Description
Fixes #1367 — Race condition between `CHUNK_RELEASED` and
`DOWNLOAD_FAILED` causes invalid state transition warnings during Arrow
Cloud Fetch operations.
### Root Cause
When chunk download threads are running in parallel and their
connections are cut off by `IdleConnectionEvictor` (due to no
socket-level data movement caused by extreme resource pressure in
low-resource environments), the downloads fail. When the consumer
reaches a failed chunk, `rs.next()` throws a `DatabricksSQLException`.
This can trigger `rs.close()` — either explicitly by the user's error
handling code, implicitly by try-with-resources, or automatically by
frameworks that close resources on error. Additionally, `rs.close()` can
also be called independently by the user before consuming all data.
In either case, `doClose()` calls `shutdownNow()` on the download thread
pool and releases all chunks to `CHUNK_RELEASED`. But `shutdownNow()`
only sends an interrupt signal — it doesn't wait for threads to stop.
Download threads that are still processing their errors may try to set
`DOWNLOAD_FAILED` after their chunks have already been released. The
transition `CHUNK_RELEASED → DOWNLOAD_FAILED` is invalid in the chunk
lifecycle state machine, causing the error/warning.
### Race Condition Sequence
```
Thread A (consumer/main) Thread B (download)
───────────────────────── ─────────────────────
downloading chunk [7]...
connection killed by evictor
IOException caught
retries exhausted
↓
rs.next() → gets error (processing error...)
rs.close()
→ doClose()
→ shutdownNow() (interrupt sent, but not yet processed)
→ chunk[7].releaseChunk()
→ CHUNK_RELEASED ✓
→ setStatus(DOWNLOAD_FAILED)
CHUNK_RELEASED → DOWNLOAD_FAILED ✗
Invalid transition!
```
### Fix
Added `awaitTermination(3, TimeUnit.SECONDS)` after `shutdownNow()` in
both `RemoteChunkProvider.doClose()` and
`StreamingChunkProvider.close()`. This ensures download threads complete
their error handling path (catch → finally → setStatus) before chunks
are released.
The 3-second timeout is a conservative upper bound. After
`shutdownNow()` interrupts threads, they exit their retry sleep
(`Thread.sleep(1500)`) immediately via `InterruptedException` and
process the error path in milliseconds. `awaitTermination()` returns as
soon as all threads finish — it does not wait the full 3 seconds.
### Files Changed
| File | Change |
|------|--------|
| `RemoteChunkProvider.java` | Added `awaitTermination(3s)` after
`shutdownNow()` in `doClose()`, added INFO log for close lifecycle |
| `StreamingChunkProvider.java` | Added `awaitTermination(3s)` after
`shutdownNow()` in `close()` |
| `NEXT_CHANGELOG.md` | Added changelog entry |
## Testing
### Reproduction (Docker + iptables)
Reproduced the error/warning logs from issue #1367 using:
1. Docker container with network throttled to 5mbit (slow chunk
downloads)
2. `iptables REJECT --tcp-reset` to kill connections mid-download
(simulating `IdleConnectionEvictor` behavior)
3. Calling `rs.close()` while download threads process errors
4. Temporary `MAX_RETRIES=0` + uninterruptible sleep before
`setStatus(DOWNLOAD_FAILED)` to widen the race window
**Before fix** — driver logs showed:
```
WARNING: Invalid state transition for chunk [1]: CHUNK_RELEASED -> DOWNLOAD_FAILED
WARNING: Invalid state transition for chunk [2]: CHUNK_RELEASED -> DOWNLOAD_FAILED
```
**After fix** — no invalid state transitions. `awaitTermination()`
waited for threads to finish before releasing chunks.
Verified for both `RemoteChunkProvider` and `StreamingChunkProvider`.
## Additional Notes to the Reviewer
- The `IdleConnectionEvictor` shares a connection pool between the
Thrift transport and chunk downloads (`DatabricksHttpClient` via
`DatabricksHttpClientFactory`). In resource-constrained environments,
CPU starvation causes the Java layer to stop reading from sockets,
making active download connections appear idle to the evictor.
- `shutdownNow()` is best-effort per Java docs — it cannot guarantee
threads will stop. The interrupt flag is set, but threads only respond
when they hit an interruptible operation (`Thread.sleep`, I/O). Code
after the interrupt (catch blocks, finally blocks, `setStatus()` calls)
is pure computation that doesn't check interrupts.
- The same race pattern existed in both `RemoteChunkProvider` (standard
Cloud Fetch) and `StreamingChunkProvider` (opt-in via
`EnableStreamingChunkProvider`). Both are fixed.
- `RemoteChunkProviderV2` (incubator) downloads chunks sequentially
without a thread pool, so it is not affected.
---------
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Co-authored-by: Samikshya Chand <148681192+samikshya-db@users.noreply.github.com>1 parent 25169a4 commit 2a76760
4 files changed
Lines changed: 42 additions & 1 deletion
File tree
- src/main/java/com/databricks/jdbc/api/impl/arrow
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
| |||
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
241 | 241 | | |
242 | 242 | | |
243 | 243 | | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
244 | 249 | | |
245 | 250 | | |
246 | 251 | | |
| |||
Lines changed: 23 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
| 11 | + | |
10 | 12 | | |
11 | 13 | | |
12 | 14 | | |
| |||
16 | 18 | | |
17 | 19 | | |
18 | 20 | | |
| 21 | + | |
19 | 22 | | |
20 | 23 | | |
21 | 24 | | |
| 25 | + | |
22 | 26 | | |
23 | 27 | | |
24 | 28 | | |
| |||
125 | 129 | | |
126 | 130 | | |
127 | 131 | | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
128 | 136 | | |
129 | | - | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
130 | 152 | | |
131 | 153 | | |
132 | 154 | | |
| |||
Lines changed: 13 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
| |||
276 | 277 | | |
277 | 278 | | |
278 | 279 | | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
279 | 292 | | |
280 | 293 | | |
281 | 294 | | |
| |||
0 commit comments