Skip to content

perf: reduce GC pressure in protobuf serialization#3242

Merged
andygrove merged 2 commits into
apache:mainfrom
andygrove:reduce-gc-pressure-protobuf-serialization
Jan 22, 2026
Merged

perf: reduce GC pressure in protobuf serialization#3242
andygrove merged 2 commits into
apache:mainfrom
andygrove:reduce-gc-pressure-protobuf-serialization

Conversation

@andygrove
Copy link
Copy Markdown
Member

Replace ByteArrayOutputStream with direct CodedOutputStream serialization to eliminate unnecessary allocations during query plan serialization.

This optimization:

  • Pre-allocates exact buffer size using getSerializedSize()
  • Eliminates ByteArrayOutputStream's internal buffer resizing
  • Removes defensive array copying from toByteArray()
  • Applies to 5 hot paths called per-partition during query execution

For a query with 1000 partitions, this eliminates 5000+ unnecessary allocations and array copies, significantly reducing GC pressure.

Changes:

  • operators.scala: getCometIterator() and convertBlock()
  • CometNativeWriteExec.scala: serializedPlanOpt() and doExecute()
  • ParquetFilters.scala: createNativeFilters()

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Replace ByteArrayOutputStream with direct CodedOutputStream serialization
to eliminate unnecessary allocations during query plan serialization.

This optimization:
- Pre-allocates exact buffer size using getSerializedSize()
- Eliminates ByteArrayOutputStream's internal buffer resizing
- Removes defensive array copying from toByteArray()
- Applies to 5 hot paths called per-partition during query execution

For a query with 1000 partitions, this eliminates 5000+ unnecessary
allocations and array copies, significantly reducing GC pressure.

Changes:
- operators.scala: getCometIterator() and convertBlock()
- CometNativeWriteExec.scala: serializedPlanOpt() and doExecute()
- ParquetFilters.scala: createNativeFilters()
@andygrove andygrove marked this pull request as ready for review January 22, 2026 19:29
outputStream.close()
outputStream.toByteArray
val size = expr.getSerializedSize
val bytes = new Array[Byte](size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size of Int type, so I think we're safe here.

nativeOp.writeTo(out)
out.close()
SerializedPlan(Some(out.toByteArray))
val size = nativeOp.getSerializedSize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) We could extract some of the shared code to a utility method but not urgent.

Copy link
Copy Markdown
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 78.57143% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.05%. Comparing base (f09f8af) to head (0a6f97d).
⚠️ Report is 865 commits behind head on main.

Files with missing lines Patch % Lines
.../apache/spark/sql/comet/CometNativeWriteExec.scala 45.45% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3242      +/-   ##
============================================
+ Coverage     56.12%   60.05%   +3.92%     
- Complexity      976     1429     +453     
============================================
  Files           119      170      +51     
  Lines         11743    15782    +4039     
  Branches       2251     2606     +355     
============================================
+ Hits           6591     9478    +2887     
- Misses         4012     4983     +971     
- Partials       1140     1321     +181     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @andygrove!

@andygrove
Copy link
Copy Markdown
Member Author

Thanks for the reviews @hsiang-c and @mbutrovich. I will merge this when CI is green.

@andygrove andygrove merged commit 48776fe into apache:main Jan 22, 2026
118 checks passed
@andygrove andygrove deleted the reduce-gc-pressure-protobuf-serialization branch January 22, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants