Udate xtool to add GC simulation functionality #759
Conversation
235bdab to
20f5361
Compare
|
Won't these changes make garbage-collection harder @sirahd since xorbs/shards aren't known to Xet backend? I understand the benefits (just read the description) maybe the right approach is that this is only allowed for localhost CAS? |
| let hub_client = hub_info.build_hub_client(ctx)?; | ||
| let file_hash = MerkleHash::from_hex(&args.hash)?; | ||
| let ret = query_reconstruction_hub(ctx, file_hash, args.bytes_range, hub_client).await?; | ||
| eprintln!("{ret:?}"); |
There was a problem hiding this comment.
Hub query outputs debug format to stderr, not stdout
High Severity
In hub_query::run, the Hub mode path writes the result using eprintln!("{ret:?}") — debug format to stderr — while the direct mode delegates to query::run which writes pretty JSON to stdout via println!("{json}"). This means xtool query in Hub mode produces no stdout output at all and writes a Rust debug representation to stderr instead of machine-parseable JSON. Users piping or capturing stdout will get nothing.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c03ef6e. Configure here.
|
Gotcha, I'm open to that adjustment and to not allowing uploads in this way. Essentially, the problem would be that all files uploaded this way would be ephemeral, i.e. they would get garbage collected quickly, so another option would be to simply warn that this is the case. For now, though, I can make uploads error out to remote CAS. |
|
Updated this PR to disallow uploads to huggingface endpoints; downloads and queries still work as expected. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b0796d5. Configure here.
| ) | ||
| { | ||
| anyhow::bail!("Uploading files allowed only through huggingface hub APIs."); | ||
| } |
There was a problem hiding this comment.
Upload guard doesn't block direct-mode remote CAS uploads
High Severity
The upload restriction only checks cli.is_hub_mode(), so file upload in direct mode with a remote endpoint (e.g. xtool --endpoint https://some-cas-server file upload data.bin) is still allowed. Per the PR discussion, the intent was to block uploads to any remote CAS — not just hub-mode uploads — because orphaned xorbs would be garbage collected. The guard needs to also reject uploads when the resolved endpoint is not a local:// path.
Reviewed by Cursor Bugbot for commit b0796d5. Configure here.
| .map(|e| e.path().to_path_buf()) | ||
| }) | ||
| .collect() | ||
| } |
There was a problem hiding this comment.
Recursive scan doesn't filter git files unlike dedup
Low Severity
collect_files in stats.rs does not filter .git, .gitignore, or .gitattributes entries during recursive walks, while the equivalent walk_files in dedup.rs does. Running scan --recursive and dedup --recursive on the same directory will silently process different file sets, producing inconsistent results.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b0796d5. Configure here.
| /// Download a file by its xet hash. | ||
| Download(DownloadArgs), | ||
| /// Dry-run dedup and compression analysis (no upload). | ||
| Scan(ScanArgs), |
There was a problem hiding this comment.
It seems this is a duplicate of the "dedup" command above.
| /// Dry-run dedup and compression analysis (no upload). | ||
| Scan(ScanArgs), | ||
| /// Show reconstruction metadata for a file hash (JSON). | ||
| DumpReconstruction(DumpReconstructionArgs), |
There was a problem hiding this comment.
And this is a duplicate of the "query" command above
rajatarya
left a comment
There was a problem hiding this comment.
Big +1 on bringing xtool into xet_pkg with first-class direct/hub mode resolution and a real test surface (upload/download/scan/query roundtrips backed by a LocalClient). Tests exercising both single-file and multi-file paths against local:// end up doing real value as smoke tests for the session/commit machinery. The EndpointConfig::resolve split (direct vs Hub-JWT) reads cleanly and the operation_for_command mapping is the right shape.
@seanses already caught the top-level Dedup / Query vs file scan / file dump-reconstruction duplication on main.rs:91/:93 — that's the biggest UX wart and worth resolving before merge. My instinct would be to keep the top-level forms and drop the file aliases (since xtool dedup and xtool query are how the existing scripting probably calls in), but either direction is fine as long as one is picked.
A few items to consider beyond what cursor + @seanses flagged:
-
Stdout vs stderr policy is inconsistent across commands —
statsusesprintln!,upload/downloaduseeprintln!,hub_querywas flagged by cursor foreprintln!debug output. Worth nailing down a one-paragraph rule (e.g. "binary file payload to stdout, structured JSON via--output, status/info to stderr") and applying it uniformly. See inline onupload.rs:43. -
commit.abort()semantics on partial upload failure — when one of N uploads fails, the current path aborts the commit but the user has already seen success messages for the other N-1. Are those uploaded xorbs cleaned up byabort, or does the user end up with orphan content in CAS? See inline onupload.rs:117. -
--write-rangewithout--source-rangestill streams the full file from CAS and discards the bytes outside the window. For a 100GB file with--write-range 0..1024, that's 100GB of network for 1KB of writes. See inline ondownload.rs:108.
Once the duplicate-command structure is decided + the high-severity cursor findings (global -c collision on dedup.rs:44, scan requesting upload JWT on main.rs:159, upload guard not blocking direct-mode remote on main.rs:211) are addressed, I think this is in good shape.
| std::fs::write(output_path, json)?; | ||
| } else if !cli.quiet { | ||
| for meta in &results { | ||
| eprintln!( |
There was a problem hiding this comment.
Stdout vs stderr policy worth nailing down at the tool level. Cursor flagged hub_query.rs:33 for routing structured output to stderr; here upload writes its human-readable per-file summary to stderr; stats.rs:36-39 writes to stdout. A reader trying to script around xtool won't have a consistent rule.
A reasonable convention for a dev tool like this:
downloadto stdout when-ois omitted → file payload goes to stdout (already correct)--output FILE→ JSON results to that file, nothing to stdout- Otherwise → human-readable summary to stdout (so
xtool file upload mydata.bin | jq '...'works on the JSON-y form), status/progress to stderr
If you go that route, this eprintln! would become a println!, and download.rs:136 would stay on stderr (status line). What do you think?
| } | ||
| } | ||
|
|
||
| if had_error { |
There was a problem hiding this comment.
What does commit.abort() do to xorbs that successful files (the N-1 that finalized before file N failed) have already pushed up? If they're left in CAS, then a user who uploads 100 files, sees 99 success lines, hits one error, gets a non-zero exit, and has no obvious cleanup story for the partial state. If abort deletes them, then the user might be surprised that their successful uploads were rolled back.
Either answer is fine, but it's worth being explicit in the docs / --help. A --continue-on-error flag (or --all-or-nothing as the explicit opt-in) would also let scripts choose the semantics they want.
| File::create(output_path).with_context(|| format!("failed to create {}", output_path.display()))? | ||
| }; | ||
|
|
||
| if let Some(ref wr) = write_range { |
There was a problem hiding this comment.
When --write-range 0..1024 is set without --source-range, the stream still pulls every byte from the source and we just discard everything past write_limit. For a 100 GB file with a 1 KB write window, that's a lot of wasted bandwidth.
Would it make sense to default source_range = Some(write_range) when --source-range isn't given? Or at minimum, document that callers who want a bounded fetch need to pass both. (The --write-range doc says "Seeks to the start offset before writing" which implies it's purely a sink-side concern, but the bandwidth implication is the part that bites.)


This PR adds new functionality to the xtool binary for directly uploading, downloading, and inspecting files against a CAS endpoint or remote repo — useful for development, debugging, and scripting without going through git-xet or huggingface_hub.
The binary adds
xtool fileand exposes four new subcommands:upload — upload one or more files (or stdin) and emit file metadata (hash, size, sha256).
download — download by xet hash to a file or stdout, with optional source and write byte ranges.
scan — dry-run dedup/compression analysis without uploading data.
dump-reconstruction — fetch and display reconstruction metadata as JSON.
Endpoint resolution currently follows the same conventions as session API, but using arguments passed in. --endpoint overrides HF_ENDPOINT, which defaults to https://huggingface.co; endpoint can also be a local directory in which case a LocalClient is used. Token resolution uses --token then HF_TOKEN.
All config values can be overridden with -c KEY=VALUE.
Note
Medium Risk
Adds a new
xtoolCLI surface with endpoint/token resolution and CAS upload/download behaviors, which can affect developer workflows and remote interactions if misconfigured. Also changes logging build behavior for wasm/non-wasm targets, which could impact cross-platform compilation.Overview
Moves the
xtoolbinary fromxet_dataintohf-xet(xet_pkg) and significantly expands it into a structured CLI with direct CAS mode and Hub-resolved mode (JWT + refresh URL support), including global--config KEY=VALUEoverrides.Adds a new
xtool filecommand group for upload (including stdin and optional JSON output), download (stdout/file with source/write byte ranges), scan (dry-run dedup/compression metrics), and dump-reconstruction (JSON reconstruction query), plus updateddedupandquerybehaviors and a comprehensive integration test suite for the new CLI.Separately, updates
xet_runtimelogging to gate file-logging constants/tracing-appenderusage to non-wasm targets and to return a clear error when file logging is requested on wasm.Reviewed by Cursor Bugbot for commit b0796d5. Bugbot is set up for automated code reviews on this repo. Configure here.