Commit a77f908
[Hub] Dedupe file entries by xet hash within a shard (#2134)
## Summary
- Track xet file hashes already written into the current shard's
file-info section and skip duplicates so the shard's file list behaves
like a set rather than a list.
- The dedup set is scoped to the entire `uploadShards` call so it
persists across shard flushes — a duplicate in a later shard is also
skipped.
- File events are still yielded to callers for every input path, so
consumers (e.g. `commit.ts`) keep getting per-path progress and `path ->
xet hash` mapping for duplicates.
Today the only dedup that happens is sha256-based and lives in
`createXorbs` — if a file has no sha256 (or only one of the duplicates
does), both entries currently get written into the shard's file list.
This change handles that case.
## Test plan
- [x] `npx vitest run src/utils/uploadShards.spec.ts` — added a case
that feeds three paths through `uploadShards` with the mocked
`createXorbs` yielding the same xet hash, and asserts: each path still
gets a `file` event, but only one file entry ends up in the resulting
shard.
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes shard serialization behavior by skipping duplicate file
entries based on Xet hash, which could impact upload correctness and
downstream consumers if the dedupe scope/ordering assumptions are wrong.
>
> **Overview**
> Makes `uploadShards` treat the shard file-info section like a set by
tracking seen Xet file hashes and **skipping writing duplicate file
entries** while still yielding `file` events for every input path.
>
> Adds a new unit test that feeds multiple paths producing the same Xet
hash and asserts only one file entry is stored in the uploaded shard,
while all paths still receive `file` events.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
24884a1. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Co-authored-by: Eliott C. <coyotte508@gmail.com>1 parent ac45210 commit a77f908
2 files changed
Lines changed: 66 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
94 | 105 | | |
95 | 106 | | |
96 | 107 | | |
| |||
167 | 178 | | |
168 | 179 | | |
169 | 180 | | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
170 | 230 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
| 90 | + | |
90 | 91 | | |
91 | 92 | | |
92 | 93 | | |
| |||
172 | 173 | | |
173 | 174 | | |
174 | 175 | | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
175 | 181 | | |
176 | 182 | | |
177 | 183 | | |
| |||
0 commit comments