Skip to content

Udate xtool to add GC simulation functionality #759

Open
hoytak wants to merge 8 commits into
mainfrom
hoytak/260317-xet-cli
Open

Udate xtool to add GC simulation functionality #759
hoytak wants to merge 8 commits into
mainfrom
hoytak/260317-xet-cli

Conversation

@hoytak
Copy link
Copy Markdown
Collaborator

@hoytak hoytak commented Mar 27, 2026

This PR adds new functionality to the xtool binary for directly uploading, downloading, and inspecting files against a CAS endpoint or remote repo — useful for development, debugging, and scripting without going through git-xet or huggingface_hub.

The binary adds xtool file and exposes four new subcommands:

upload — upload one or more files (or stdin) and emit file metadata (hash, size, sha256).
download — download by xet hash to a file or stdout, with optional source and write byte ranges.
scan — dry-run dedup/compression analysis without uploading data.
dump-reconstruction — fetch and display reconstruction metadata as JSON.

Endpoint resolution currently follows the same conventions as session API, but using arguments passed in. --endpoint overrides HF_ENDPOINT, which defaults to https://huggingface.co; endpoint can also be a local directory in which case a LocalClient is used. Token resolution uses --token then HF_TOKEN.

All config values can be overridden with -c KEY=VALUE.

# Upload a single file and see its hash/size/sha256
xtool file upload mydata.bin

# Upload multiple files at once
xtool file upload file1.bin file2.bin file3.bin

# Upload from stdin
cat model.safetensors | xtool file upload -

# Upload and write results to JSON
xtool file upload --output results.json *.bin

# Download a file by hash
xtool file download abc123...def -o restored.bin 

# Download a byte range to stdout
xtool file download abc123...def --source-range 0..4096

# Download a range and write it into a specific offset of an existing file
xtool file download abc123...def -o target.bin --source-range 1024..2048 --write-range 0..1024

# Dry-run scan to check dedup/compression ratio without uploading
xtool file scan --recursive ./dataset/

# Inspect reconstruction metadata for a file hash
xtool file dump-reconstruction abc123...def

# Use a local CAS directory instead of a remote endpoint
xtool --endpoint /tmp/local-cas file upload data.bin

Note

Medium Risk
Adds a new xtool CLI surface with endpoint/token resolution and CAS upload/download behaviors, which can affect developer workflows and remote interactions if misconfigured. Also changes logging build behavior for wasm/non-wasm targets, which could impact cross-platform compilation.

Overview
Moves the xtool binary from xet_data into hf-xet (xet_pkg) and significantly expands it into a structured CLI with direct CAS mode and Hub-resolved mode (JWT + refresh URL support), including global --config KEY=VALUE overrides.

Adds a new xtool file command group for upload (including stdin and optional JSON output), download (stdout/file with source/write byte ranges), scan (dry-run dedup/compression metrics), and dump-reconstruction (JSON reconstruction query), plus updated dedup and query behaviors and a comprehensive integration test suite for the new CLI.

Separately, updates xet_runtime logging to gate file-logging constants/tracing-appender usage to non-wasm targets and to return a clear error when file logging is requested on wasm.

Reviewed by Cursor Bugbot for commit b0796d5. Bugbot is set up for automated code reviews on this repo. Configure here.

@hoytak hoytak force-pushed the hoytak/260317-xet-cli branch from 235bdab to 20f5361 Compare April 23, 2026 10:38
@hoytak hoytak changed the title Draft: xet CLI utility for file upload, download, and inspection Udate xtool to add GC simulation functionality Apr 23, 2026
@hoytak hoytak marked this pull request as ready for review April 23, 2026 12:28
Comment thread xet_pkg/src/bin/xtool/dedup.rs
Comment thread xet_pkg/src/bin/xtool/endpoint.rs
@rajatarya
Copy link
Copy Markdown
Collaborator

Won't these changes make garbage-collection harder @sirahd since xorbs/shards aren't known to Xet backend?

I understand the benefits (just read the description) maybe the right approach is that this is only allowed for localhost CAS?

let hub_client = hub_info.build_hub_client(ctx)?;
let file_hash = MerkleHash::from_hex(&args.hash)?;
let ret = query_reconstruction_hub(ctx, file_hash, args.bytes_range, hub_client).await?;
eprintln!("{ret:?}");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hub query outputs debug format to stderr, not stdout

High Severity

In hub_query::run, the Hub mode path writes the result using eprintln!("{ret:?}") — debug format to stderr — while the direct mode delegates to query::run which writes pretty JSON to stdout via println!("{json}"). This means xtool query in Hub mode produces no stdout output at all and writes a Rust debug representation to stderr instead of machine-parseable JSON. Users piping or capturing stdout will get nothing.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c03ef6e. Configure here.

Comment thread xet_pkg/src/bin/xtool/dedup.rs
Comment thread xet_pkg/src/bin/xtool/main.rs
@hoytak
Copy link
Copy Markdown
Collaborator Author

hoytak commented Apr 24, 2026

Gotcha, I'm open to that adjustment and to not allowing uploads in this way. Essentially, the problem would be that all files uploaded this way would be ephemeral, i.e. they would get garbage collected quickly, so another option would be to simply warn that this is the case.

For now, though, I can make uploads error out to remote CAS.

@hoytak
Copy link
Copy Markdown
Collaborator Author

hoytak commented Apr 24, 2026

Updated this PR to disallow uploads to huggingface endpoints; downloads and queries still work as expected.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b0796d5. Configure here.

)
{
anyhow::bail!("Uploading files allowed only through huggingface hub APIs.");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upload guard doesn't block direct-mode remote CAS uploads

High Severity

The upload restriction only checks cli.is_hub_mode(), so file upload in direct mode with a remote endpoint (e.g. xtool --endpoint https://some-cas-server file upload data.bin) is still allowed. Per the PR discussion, the intent was to block uploads to any remote CAS — not just hub-mode uploads — because orphaned xorbs would be garbage collected. The guard needs to also reject uploads when the resolved endpoint is not a local:// path.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b0796d5. Configure here.

.map(|e| e.path().to_path_buf())
})
.collect()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursive scan doesn't filter git files unlike dedup

Low Severity

collect_files in stats.rs does not filter .git, .gitignore, or .gitattributes entries during recursive walks, while the equivalent walk_files in dedup.rs does. Running scan --recursive and dedup --recursive on the same directory will silently process different file sets, producing inconsistent results.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b0796d5. Configure here.

/// Download a file by its xet hash.
Download(DownloadArgs),
/// Dry-run dedup and compression analysis (no upload).
Scan(ScanArgs),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is a duplicate of the "dedup" command above.

/// Dry-run dedup and compression analysis (no upload).
Scan(ScanArgs),
/// Show reconstruction metadata for a file hash (JSON).
DumpReconstruction(DumpReconstructionArgs),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is a duplicate of the "query" command above

Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big +1 on bringing xtool into xet_pkg with first-class direct/hub mode resolution and a real test surface (upload/download/scan/query roundtrips backed by a LocalClient). Tests exercising both single-file and multi-file paths against local:// end up doing real value as smoke tests for the session/commit machinery. The EndpointConfig::resolve split (direct vs Hub-JWT) reads cleanly and the operation_for_command mapping is the right shape.

@seanses already caught the top-level Dedup / Query vs file scan / file dump-reconstruction duplication on main.rs:91/:93 — that's the biggest UX wart and worth resolving before merge. My instinct would be to keep the top-level forms and drop the file aliases (since xtool dedup and xtool query are how the existing scripting probably calls in), but either direction is fine as long as one is picked.

A few items to consider beyond what cursor + @seanses flagged:

  1. Stdout vs stderr policy is inconsistent across commands — stats uses println!, upload/download use eprintln!, hub_query was flagged by cursor for eprintln! debug output. Worth nailing down a one-paragraph rule (e.g. "binary file payload to stdout, structured JSON via --output, status/info to stderr") and applying it uniformly. See inline on upload.rs:43.

  2. commit.abort() semantics on partial upload failure — when one of N uploads fails, the current path aborts the commit but the user has already seen success messages for the other N-1. Are those uploaded xorbs cleaned up by abort, or does the user end up with orphan content in CAS? See inline on upload.rs:117.

  3. --write-range without --source-range still streams the full file from CAS and discards the bytes outside the window. For a 100GB file with --write-range 0..1024, that's 100GB of network for 1KB of writes. See inline on download.rs:108.

Once the duplicate-command structure is decided + the high-severity cursor findings (global -c collision on dedup.rs:44, scan requesting upload JWT on main.rs:159, upload guard not blocking direct-mode remote on main.rs:211) are addressed, I think this is in good shape.

std::fs::write(output_path, json)?;
} else if !cli.quiet {
for meta in &results {
eprintln!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stdout vs stderr policy worth nailing down at the tool level. Cursor flagged hub_query.rs:33 for routing structured output to stderr; here upload writes its human-readable per-file summary to stderr; stats.rs:36-39 writes to stdout. A reader trying to script around xtool won't have a consistent rule.

A reasonable convention for a dev tool like this:

  • download to stdout when -o is omitted → file payload goes to stdout (already correct)
  • --output FILE → JSON results to that file, nothing to stdout
  • Otherwise → human-readable summary to stdout (so xtool file upload mydata.bin | jq '...' works on the JSON-y form), status/progress to stderr

If you go that route, this eprintln! would become a println!, and download.rs:136 would stay on stderr (status line). What do you think?

}
}

if had_error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does commit.abort() do to xorbs that successful files (the N-1 that finalized before file N failed) have already pushed up? If they're left in CAS, then a user who uploads 100 files, sees 99 success lines, hits one error, gets a non-zero exit, and has no obvious cleanup story for the partial state. If abort deletes them, then the user might be surprised that their successful uploads were rolled back.

Either answer is fine, but it's worth being explicit in the docs / --help. A --continue-on-error flag (or --all-or-nothing as the explicit opt-in) would also let scripts choose the semantics they want.

File::create(output_path).with_context(|| format!("failed to create {}", output_path.display()))?
};

if let Some(ref wr) = write_range {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --write-range 0..1024 is set without --source-range, the stream still pulls every byte from the source and we just discard everything past write_limit. For a 100 GB file with a 1 KB write window, that's a lot of wasted bandwidth.

Would it make sense to default source_range = Some(write_range) when --source-range isn't given? Or at minimum, document that callers who want a bounded fetch need to pass both. (The --write-range doc says "Seeks to the start offset before writing" which implies it's purely a sink-side concern, but the bandwidth implication is the part that bites.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants