Skip to content

Extract DeletionVector logic from PuffinFile#3491

Open
ebyhr wants to merge 3 commits into
apache:mainfrom
ebyhr:ebi/puffin-refactoring
Open

Extract DeletionVector logic from PuffinFile#3491
ebyhr wants to merge 3 commits into
apache:mainfrom
ebyhr:ebi/puffin-refactoring

Conversation

@ebyhr

@ebyhr ebyhr commented Jun 13, 2026

Copy link
Copy Markdown
Member

Rationale for this change

PuffinFile handles two tasks: format parsing (magic bytes, footer, blobs) and deletion vector domain logic (bitmap deserialization and PyArrow conversion).
This will become problematic when we introduce support for the NDV apache-datasketches-theta-v1 blob in the future.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes - PuffinFile class user needs to call DeletionVector.

@ebyhr ebyhr force-pushed the ebi/puffin-refactoring branch 3 times, most recently from cf9a2ce to 374d25c Compare June 14, 2026 11:49

@sungwy sungwy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR @ebyhr and I agree this is a good cleanup.

I added some comments

Comment thread pyiceberg/table/deletion_vector.py Outdated
Comment thread pyiceberg/table/puffin.py
@ebyhr

ebyhr commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

Thanks for your review! Addressed comments.

Comment thread pyiceberg/table/puffin.py Outdated
@sungwy sungwy requested a review from kevinjqliu June 22, 2026 21:47
@sungwy

sungwy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Thanks for this contribution @ebyhr . I've approved it, and I'll leave it open for a day for any additional reviews before we merge.

@moomindani

Copy link
Copy Markdown

Thanks for splitting this out — the read path looks behavior-preserving and the separation is clean.

For the write side, we'd like to coordinate before building on top of this: once it lands we're planning to rebase #3474 (the DV writer, currently PuffinWriter → to be renamed DeletionVectorWriter) on top of it, moving the writer along with its _serialize_bitmaps helper and the DV-specific constants into deletion_vector.py and reusing MAX_JAVA_SIGNED / PROPERTY_REFERENCED_DATA_FILE from here.

Given that, would it make sense to decide now whether the write/serialize path should live on DeletionVector itself (mirroring from_puffin_file / to_vector with a serialize counterpart) or as a separate DeletionVectorWriter class?

Our preference would be to mirror the read-side split — keep the DV bitmap serialization on DeletionVector (as the counterpart to from_puffin_file / to_vector) and leave the generic Puffin-file assembly as a format-level writer in puffin.py, so the write path scales to the future NDV blob the same way the read path now does. That said, we're happy either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants