Skip to content

Introduce abstract base classes for Arrow result handling#881

Merged
jayantsing-db merged 6 commits into
databricks:mainfrom
jayantsing-db:jayantsing-db/re-http-client
Jul 11, 2025
Merged

Introduce abstract base classes for Arrow result handling#881
jayantsing-db merged 6 commits into
databricks:mainfrom
jayantsing-db:jayantsing-db/re-http-client

Conversation

@jayantsing-db
Copy link
Copy Markdown
Collaborator

@jayantsing-db jayantsing-db commented Jul 6, 2025

Re-created from #850

Description

This PR is the first of two and splits the changes originally proposed in #634.

Key changes:

  • Create AbstractRemoteChunkProvider and AbstractArrowResultChunk as unified base implementations for chunk management
  • Add state machine (ArrowResultChunkStateMachine) to handle chunk lifecycle transitions
  • Improve thread synchronization between main thread and IO/download threads

The new async implementation (2nd PR) will provide an alternative path for handling large Arrow datasets with improved scalability, while maintaining the existing synchronous approach for backwards compatibility.

Below is a class diagram that provides a clearer understanding and concise summary of the class structure:

class-diagram

Testing

  • Fake service tests
  • Unit tests
  • Multi-DBR tests
  • Local testing in a highly concurrent environment

Additional Notes to the Reviewer

Benchmarking document: https://docs.google.com/document/d/1MvKeSnrQVKFGdkuaSPXFCrbb4TDaUqFwHrIW2rxy3zI/edit?usp=sharing

…d in databricks#634.

Key changes:
- Create AbstractRemoteChunkProvider and AbstractArrowResultChunk as unified base implementations for chunk management
- Add state machine (ArrowResultChunkStateMachine) to handle chunk lifecycle transitions
- Improve thread synchronization between main thread and IO/download threads

The new async implementation (2nd PR) will provide an alternative path for handling large Arrow datasets with improved scalability, while maintaining the existing synchronous approach for backwards compatibility.

Below is a class diagram that provides a clearer understanding and concise summary of the class structure:

![class-diagram](https://github.com/user-attachments/assets/e9503d47-5895-439a-9d39-f4963da3e5df)

- Fake service tests
- Unit tests
- Multi-DBR tests
- Local testing in a highly concurrent environment

Benchmarking document: https://docs.google.com/document/d/1MvKeSnrQVKFGdkuaSPXFCrbb4TDaUqFwHrIW2rxy3zI/edit?usp=sharing
stateMachine.transition(targetStatus);
} catch (DatabricksParsingException e) {
LOGGER.warn(
"Failed to transition to state [%s] from state [%s] for chunk [%d] and statement [%s]. Stack trace: %s",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that in case of invalid transition, it will be no-op, and user will not see any error? Won't this cause any subsequent failure or data inconsistency?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes user won't see any error. i can fail it straight away but i am afraid the current code (which this PR doesn't change) might already have invalid transitions (tech debt). So to not affect any existing workloads, I am just logging this. I have already fixed as many invalid transitions as possible using existing tests but there are chances few remain (because of possible coverage gap). let me know your thoughts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i will fail invalid transitions behind a safe-flag or a connection flag in the next PR.

Copy link
Copy Markdown
Collaborator Author

@jayantsing-db jayantsing-db Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikrantpuppala has the same concern of us not throwing error and consuming it. is it okay if throw the error in a separate PR and separate release behind a private flag? That way easier to manager/revert?

JdbcLoggerFactory.getLogger(AbstractArrowResultChunk.class);

protected static final Integer SECONDS_BUFFER_FOR_EXPIRY = 60;
protected static final long CHUNK_READY_TIMEOUT_SECONDS = 30;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this configurable?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


// Initialize valid state transitions
static {
VALID_TRANSITIONS.put(PENDING, Set.of(URL_FETCHED, CHUNK_RELEASED));
Copy link
Copy Markdown
Collaborator

@gopalldb gopalldb Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't go from pending to failure or cancelled?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in thrift client. When using thrift client, chunks start with status URL_FETCHED. But in SEA, PENDING to DOWNLOAD_FAILED is possible. I will add it. This also indicates a lack of test coverage for SEA client.

Copy link
Copy Markdown
Collaborator

@gopalldb gopalldb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that this is tested through all tests

@jayantsing-db jayantsing-db merged commit 5798a0d into databricks:main Jul 11, 2025
10 of 12 checks passed
@jayantsing-db jayantsing-db deleted the jayantsing-db/re-http-client branch July 11, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants