Skip to content

refactor: replace subxet with xet-session API#1

Merged
davanstrien merged 3 commits into
feature/hf-bucket-sinkfrom
feature/hf-bucket-xet-session
Mar 6, 2026
Merged

refactor: replace subxet with xet-session API#1
davanstrien merged 3 commits into
feature/hf-bucket-sinkfrom
feature/hf-bucket-xet-session

Conversation

@davanstrien
Copy link
Copy Markdown
Owner

Summary

  • Replace subxet crate (low-level XetClient/XetWriter) with the official xet-session API (XetSession/UploadCommit/SingleFileCleaner)
  • Fix nested tokio runtime panic by wrapping blocking xet-core calls in spawn_blocking
  • Add UploadCommit::commit() call required to persist data to XET storage
  • Serialize env-var-mutating token tests with a mutex to prevent race conditions

Details

Reduces code by ~71 lines and switches from an opaque internal crate to three well-defined xet-core public API crates (xet-session, xet-data, xet-utils).

File Before After Delta
streaming_upload.rs 251 211 -40
xet_upload.rs 132 91 -41
mod.rs 275 293 +18 (test fix)

Test plan

  • All 31 Rust unit tests pass (cargo test --features hf_bucket_sink -p polars-io)
  • cargo check --features hf_bucket_sink -p polars-stream compiles
  • Zero stale API references (subxet/BucketWriter/XetWriter) in crates/ and py-polars/
  • Python E2E: 4/4 pass (smoke 3-row, 50-row roundtrip, 10K medium, 10M large)
  • Scan→filter→transform→sink ETL pipeline (CoderForge-Preview, local)
  • HF Jobs stress test: FineWeb-Edu 10BT → filter → sink (17.1 min, 980 MB peak RSS)
  • CI wheels: Linux x64 success (ARM64 OOM — known runner limitation)
  • Fresh wheels uploaded to davanstrien/polars-hf-bucket-sink-wheels

🤖 Generated with Claude Code

davanstrien and others added 3 commits March 5, 2026 19:13
Swap the low-level subxet dependency (XetClient/XetWriter) for the
official xet-session API (XetSession/UploadCommit/SingleFileCleaner)
from huggingface/xet-core.

- xet_upload.rs: remove BucketWriter, add create_xet_session()
- streaming_upload.rs: use SingleFileCleaner instead of XetWriter,
  remove AbortOnDropHandle (48 lines)
- mod.rs: update upload_and_register_file() to use xet-session

No changes to batch.rs or hf_bucket_sink.rs (ComputeNode).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
XetSession internally creates its own tokio runtime, which panics when
built from within an existing async runtime. Wrap session creation,
upload_file, and commit() in spawn_blocking to run on tokio's blocking
thread pool.

Also add the required UploadCommit::commit() call after
SingleFileCleaner::finish() — without it, data is not persisted to XET
storage and the batch API returns "File not found".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ition

The three token extraction tests (token_from_env_var, token_from_cached_file,
token_missing_returns_error) mutate shared env vars (HF_TOKEN, HF_HOME) and
fail intermittently when run in parallel. Add a TOKEN_TEST_LOCK mutex to
serialize them.

Also update XET_SESSION_REFACTOR.md with Session A validation results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@davanstrien davanstrien merged commit 65df255 into feature/hf-bucket-sink Mar 6, 2026
18 of 24 checks passed
@davanstrien davanstrien deleted the feature/hf-bucket-xet-session branch March 6, 2026 15:43
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 6, 2026

The uncompressed lib size after this PR is 53.6980 MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant