Skip to content

Support partitionBy in VortexSparkDataSource#7218

Merged
robert3005 merged 5 commits intodevelopfrom
rk/partitionby
Apr 1, 2026
Merged

Support partitionBy in VortexSparkDataSource#7218
robert3005 merged 5 commits intodevelopfrom
rk/partitionby

Conversation

@robert3005
Copy link
Copy Markdown
Contributor

@robert3005 robert3005 commented Mar 31, 2026

Support partitionBy in spark writer

Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 added the changelog/feature A new feature label Mar 31, 2026
@robert3005 robert3005 requested a review from a10y March 31, 2026 15:52
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Mar 31, 2026

Merging this PR will not alter performance

✅ 1106 untouched benchmarks
⏩ 1522 skipped benchmarks1


Comparing rk/partitionby (4b0c14b) with develop (3ea259e)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Comment on lines +159 to +173
private String getPartitionPath(InternalRow row) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < resolvedTransforms.length; i++) {
if (i > 0) {
sb.append("/");
}
ResolvedTransform rt = resolvedTransforms[i];
sb.append(URLEncoder.encode(rt.directoryKey, StandardCharsets.UTF_8));
sb.append("=");

String value = evaluateTransform(rt, row);
sb.append(URLEncoder.encode(value, StandardCharsets.UTF_8));
}
return sb.toString();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this code really not in spark somewhere

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a long time looking, let me look again. In all fairness all of this logic is not datasource specific. I am very confused why they don't have shared handling. It could also be that we have to be a FileSource and I think while initially that's simpler there's things that it makes harder in the long term

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this logic exists only for file datasources but then you're married to hadoop. I think we are fine to reimplement it.

@robert3005
Copy link
Copy Markdown
Contributor Author

We don't implement filter pushdown yet so even though we can read and write parititions we don't prune them. Also we don't remove the parititon columns from the data yet

@robert3005
Copy link
Copy Markdown
Contributor Author

This pr needs more work - we need to remove partition values from data

Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 merged commit 8060ae0 into develop Apr 1, 2026
60 checks passed
@robert3005 robert3005 deleted the rk/partitionby branch April 1, 2026 13:55
lwwmanning pushed a commit that referenced this pull request Apr 1, 2026
Support partitionBy in spark reader/writer

---------

Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Will Manning <will@willmanning.io>
@kesavkolla
Copy link
Copy Markdown

It would be nice to see full support for hive style partitioning with vortex. For both read/write and also filter pushdown

@robert3005
Copy link
Copy Markdown
Contributor Author

This pr added everything but filter pushdown. We don’t have filter pushdown in spark data source at all right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants