Describe the bug
When a DataFusion query is cancelled mid-stream, the Java client's per-call gRPC queue retains parsed ArrowMessage objects that were never consumed. Their backing ArrowBufs stay accounted against the flight native pool indefinitely. After roughly 13 cancellations (40 GiB native limit → 5% = 2 GiB flight pool), the pool is full and every subsequent query — including a trivial count() — fails with:
StreamException[errorCode=CANCELLED, message=Failed to read message.]
Caused by: org.apache.arrow.memory.OutOfMemoryException:
Unable to allocate buffer of size 16 due to memory limit. Current allocation: 2147483648
The error stack pinpoints where buffers are allocated but never released:
Caused by: org.apache.arrow.memory.OutOfMemoryException
at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:330)
at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:298)
at org.apache.arrow.flight.ArrowMessage.frame(ArrowMessage.java:322)
at org.apache.arrow.flight.ArrowMessage$ArrowMessageHolderMarshaller.parse(ArrowMessage.java:575)
at io.grpc.MethodDescriptor.parseResponse(MethodDescriptor.java:284)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:661)
Buffers are allocated by the gRPC marshaller as soon as bytes arrive on the wire, before the application calls flightStream.next(). They live in gRPC's internal MessagesAvailable queue, owned by the per-call ClientStreamListenerImpl.
FlightTransportResponse.cancel() / close() (plugins/arrow-flight-rpc/.../FlightTransportResponse.java:156-179) call flightStream.cancel() and flightStream.close(). These tear down the FlightStream's own state, but they do not — and cannot — reach into gRPC's per-call queue to release messages that were parsed but never delivered to the application. When gRPC drops those messages, no release() runs, so the ArrowBufs stay accounted against the parent allocator.
The leak is not on the datafusion side, but on the java allocator side.
This happens at the coordinator side where the flight client is polling for new batches, the client allocator never releases in-flight batches that are cancelled by admission control.
Related component
Search:Performance
To Reproduce
Run a bunch of heavy queries with cancellations, any subsequent query fails
Expected behavior
Allocator memory pool should be cleared.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context
Add any other context about the problem here.
Describe the bug
When a DataFusion query is cancelled mid-stream, the Java client's per-call gRPC queue retains parsed
ArrowMessageobjects that were never consumed. Their backingArrowBufs stay accounted against theflightnative pool indefinitely. After roughly 13 cancellations (40 GiB native limit → 5% = 2 GiB flight pool), the pool is full and every subsequent query — including a trivialcount()— fails with:The error stack pinpoints where buffers are allocated but never released:
Buffers are allocated by the gRPC marshaller as soon as bytes arrive on the wire, before the application calls
flightStream.next(). They live in gRPC's internalMessagesAvailablequeue, owned by the per-callClientStreamListenerImpl.FlightTransportResponse.cancel()/close()(plugins/arrow-flight-rpc/.../FlightTransportResponse.java:156-179) callflightStream.cancel()andflightStream.close(). These tear down the FlightStream's own state, but they do not — and cannot — reach into gRPC's per-call queue to release messages that were parsed but never delivered to the application. When gRPC drops those messages, norelease()runs, so theArrowBufs stay accounted against the parent allocator.The leak is not on the datafusion side, but on the java allocator side.
This happens at the coordinator side where the flight client is polling for new batches, the client allocator never releases in-flight batches that are cancelled by admission control.
Related component
Search:Performance
To Reproduce
Run a bunch of heavy queries with cancellations, any subsequent query fails
Expected behavior
Allocator memory pool should be cleared.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.