Skip to content

Commit aa4839b

Browse files
committed
docs
1 parent e169cba commit aa4839b

4 files changed

Lines changed: 189 additions & 22 deletions

File tree

apps/zipsync/README.md

Lines changed: 39 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,48 @@
11
# @rushstack/zipsync
22

3-
zipsync is a tool to pack and unpack zip archives. It is designed as a single-purpose tool to pack and unpack build cache entries.
3+
zipsync is a focused tool for packing and unpacking build cache entries using a constrained subset of the ZIP format for high performance. It optimizes the common scenario where most files already exist in the target location and are unchanged.
44

5-
## Implementation
5+
## Goals & Rationale
66

7-
### Unpack
7+
- **Optimize partial unpack**: Most builds reuse the majority of previously produced outputs. Skipping rewrites preserves filesystem and page cache state.
8+
- **Only write when needed**: Fewer syscalls.
9+
- **Integrated cleanup**: Removes the need for a separate `rm -rf` pass; extra files and empty directories are pruned automatically.
10+
- **ZIP subset**: Compatibility with malware scanners.
11+
- **Fast inspection**: The central directory can be enumerated without inflating the entire archive (unlike tar+gzip).
812

9-
- Read the zip central directory record at the end of the zip file and enumerate zip entries
10-
- Parse the zipsync metadata file in the archive. This contains the SHA-1 hashes of the files
11-
- Enumerate the target directories, cleanup any files or folders that aren't in the archive
12-
- If a file exists with matching size + SHA‑1, skip writing; else unpack it
13+
## How It Works
1314

14-
### Pack
15+
### Pack Flow
1516

16-
- Enumerate the target directories.
17-
- For each file compute a SHA-1 hash for the zipsync metadata file, and the CRC32 (required by zip format), then compress it if needed. Write the headers and file contents to the zip archive.
18-
- Write the metadata file to the zip archive and the zip central directory record.
17+
```
18+
for each file F
19+
write LocalFileHeader(F)
20+
stream chunks:
21+
read -> hash + crc + maybe compress -> write
22+
finalize compressor
23+
write DataDescriptor(F)
24+
add metadata entry (same pattern)
25+
write central directory records
26+
```
1927

20-
## Constraints
28+
### Unpack Flow
2129

22-
Though archives created by zipsync can be used by other zip compatible programs, the opposite is not the case. zipsync only implements a subset of zip features to achieve greater performance.
30+
```
31+
load archive -> parse central dir -> read metadata
32+
scan filesystem & delete extraneous entries
33+
for each entry (except metadata):
34+
if unchanged (sha1 matches) => skip
35+
else extract (decompress if needed)
36+
```
37+
38+
## Why ZIP (vs tar + gzip)
39+
40+
Pros for this scenario:
41+
42+
- Central directory enables cheap listing without decompressing entire payload.
43+
- Widely understood / tooling-friendly (system explorers, scanners, CI tooling).
44+
- Per-file compression keeps selective unpack simple (no need to inflate all bytes).
45+
46+
Trade-offs:
47+
48+
- Tar+gzip can exploit cross-file redundancy for better compressed size in datasets with many similar files.

apps/zipsync/src/pack.ts

Lines changed: 53 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,20 +33,31 @@ import {
3333
METADATA_FILENAME
3434
} from './zipSyncUtils';
3535

36+
/**
37+
* File extensions for which additional DEFLATE/ZSTD compression is unlikely to help.
38+
* Used by the 'auto' compression heuristic to avoid wasting CPU on data that is already
39+
* compressed (images, media, existing archives, fonts, etc.).
40+
*/
3641
const LIKELY_COMPRESSED_EXTENSION_REGEX: RegExp =
3742
/\.(?:zip|gz|tgz|bz2|xz|7z|rar|jpg|jpeg|png|gif|webp|avif|mp4|m4v|mov|mkv|webm|mp3|ogg|aac|flac|pdf|woff|woff2)$/;
3843

44+
/**
45+
* Map zip compression method code -> incremental zlib mode label
46+
*/
3947
const zlibPackModes: Record<ZipMetaCompressionMethod, IncrementalZlibMode | undefined> = {
4048
[ZSTD_COMPRESSION]: 'zstd-compress',
4149
[DEFLATE_COMPRESSION]: 'deflate',
4250
[STORE_COMPRESSION]: undefined
4351
} as const;
4452

53+
/**
54+
* Public facing CLI option -> actual zip method used for a file we decide to compress.
55+
*/
4556
const zipSyncCompressionOptions: Record<ZipSyncOptionCompression, ZipMetaCompressionMethod> = {
4657
store: STORE_COMPRESSION,
4758
deflate: DEFLATE_COMPRESSION,
4859
zstd: ZSTD_COMPRESSION,
49-
auto: DEFLATE_COMPRESSION // 'auto' is handled specially in the code
60+
auto: DEFLATE_COMPRESSION
5061
} as const;
5162

5263
/**
@@ -82,6 +93,18 @@ export interface IZipSyncPackResult {
8293
metadata: IMetadata;
8394
}
8495

96+
/**
97+
* Create a zipsync archive by enumerating target directories, then streaming each file into the
98+
* output zip using the local file header + (optional compressed data) + data descriptor pattern.
99+
*
100+
* Performance characteristics:
101+
* - Single pass per file (no read-then-compress-then-write buffering). CRC32 + SHA-1 are computed
102+
* while streaming so the metadata JSON can later be used for selective unpack.
103+
* - Data descriptor usage (bit 3) allows writing headers before we know sizes or CRC32.
104+
* - A single timestamp (captured once) is applied to all entries for determinism.
105+
* - Metadata entry is added as a normal zip entry at the end (before central directory) so legacy
106+
* tools can still list/extract it, while zipsync can quickly parse file hashes.
107+
*/
85108
export function pack({
86109
archivePath,
87110
targetDirectories: rawTargetDirectories,
@@ -95,7 +118,7 @@ export function pack({
95118

96119
markStart('pack.total');
97120
terminal.writeDebugLine('Starting pack');
98-
// Pass 1: enumerate
121+
// Pass 1: enumerate files with a queue to avoid deep recursion
99122
markStart('pack.enumerate');
100123

101124
const filePaths: string[] = [];
@@ -140,7 +163,7 @@ export function pack({
140163
terminal.writeLine(`Found ${filePaths.length} files to pack (enumerated)`);
141164
markEnd('pack.enumerate');
142165

143-
// Pass 2: read + hash + compress
166+
// Pass 2: stream each file: read chunks -> hash + (maybe) compress -> write local header + data descriptor.
144167
markStart('pack.prepareEntries');
145168
const bufferSize: number = 1 << 25; // 32 MiB
146169
const inputBuffer: Buffer<ArrayBuffer> = Buffer.allocUnsafeSlow(bufferSize);
@@ -150,6 +173,9 @@ export function pack({
150173
using zipFile: IDisposableFileHandle = getDisposableFileHandle(archivePath, 'w');
151174
let currentOffset: number = 0;
152175
// Use this function to do any write to the zip file, so that we can track the current offset.
176+
/**
177+
* Write a raw chunk to the archive file descriptor, updating current offset.
178+
*/
153179
function writeChunkToZip(chunk: Uint8Array, lengthBytes: number = chunk.byteLength): void {
154180
let offset: number = 0;
155181
while (lengthBytes > 0 && offset < chunk.byteLength) {
@@ -162,19 +188,35 @@ export function pack({
162188
}
163189
currentOffset += offset;
164190
}
191+
/** Convenience wrapper for writing multiple buffers sequentially. */
165192
function writeChunksToZip(chunks: Uint8Array[]): void {
166193
for (const chunk of chunks) {
167194
writeChunkToZip(chunk);
168195
}
169196
}
170197

171198
const dosDateTimeNow: { time: number; date: number } = dosDateTime(new Date());
199+
/**
200+
* Stream a single file into the archive.
201+
* Steps:
202+
* 1. Decide compression (based on user choice + heuristic).
203+
* 2. Emit local file header (sizes/CRC zeroed because we use a data descriptor).
204+
* 3. Read file in 32 MiB chunks: update SHA-1 + CRC32; optionally feed compressor or write raw.
205+
* 4. Flush compressor (if any) and write trailing data descriptor containing sizes + CRC.
206+
* 5. Return populated entry metadata for later central directory + JSON metadata.
207+
*/
172208
function writeFileEntry(relativePath: string): IFileEntry {
209+
/**
210+
* Basic heuristic: skip re-compressing file types that are already compressed.
211+
*/
173212
function isLikelyAlreadyCompressed(filename: string): boolean {
174213
return LIKELY_COMPRESSED_EXTENSION_REGEX.test(filename.toLowerCase());
175214
}
176215
const fullPath: string = path.join(baseDir, relativePath);
177216

217+
/**
218+
* Read file in large fixed-size buffer; invoke callback for each filled chunk.
219+
*/
178220
const readInputInChunks: (onChunk: (bytesInInputBuffer: number) => void) => void = (
179221
onChunk: (bytesInInputBuffer: number) => void
180222
): void => {
@@ -231,6 +273,9 @@ export function pack({
231273
let uncompressedSize: number = 0;
232274
let compressedSize: number = 0;
233275

276+
/**
277+
* Compressor instance (deflate or zstd) created only if needed.
278+
*/
234279
using incrementalZlib: IIncrementalZlib | undefined = shouldCompress
235280
? createIncrementalZlib(
236281
outputBuffer,
@@ -270,6 +315,7 @@ export function pack({
270315
entry.crc32 = crc32;
271316
entry.sha1Hash = sha1Hash;
272317

318+
// Trailing data descriptor now that final CRC/sizes are known.
273319
writeChunkToZip(writeDataDescriptor(entry));
274320

275321
terminal.writeVerboseLine(
@@ -284,6 +330,7 @@ export function pack({
284330
}
285331

286332
const entries: IFileEntry[] = [];
333+
// Emit all file entries in enumeration order.
287334
for (const relativePath of filePaths) {
288335
entries.push(writeFileEntry(relativePath));
289336
}
@@ -293,6 +340,7 @@ export function pack({
293340

294341
markStart('pack.metadata.build');
295342
const metadata: IMetadata = { version: METADATA_VERSION, files: {} };
343+
// Build metadata map used for selective unpack (size + SHA‑1 per file).
296344
for (const entry of entries) {
297345
metadata.files[entry.filename] = { size: entry.size, sha1Hash: entry.sha1Hash };
298346
}
@@ -306,6 +354,7 @@ export function pack({
306354
let metadataCompressionMethod: ZipMetaCompressionMethod = zipSyncCompressionOptions.store;
307355
let metadataData: Buffer = metadataBuffer;
308356
let metadataCompressedSize: number = metadataBuffer.length;
357+
// Compress metadata (deflate) iff user allowed compression and it helps (>64 bytes & smaller result).
309358
if (compression !== 'store' && metadataBuffer.length > 64) {
310359
const compressed: Buffer = zlib.deflateRawSync(metadataBuffer, { level: 9 });
311360
if (compressed.length < metadataBuffer.length) {
@@ -348,6 +397,7 @@ export function pack({
348397

349398
markStart('pack.write.centralDirectory');
350399
const centralDirOffset: number = currentOffset;
400+
// Emit central directory records.
351401
for (const entry of entries) {
352402
writeChunksToZip(writeCentralDirectoryHeader(entry));
353403
}

apps/zipsync/src/unpack.ts

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,9 @@ export interface IZipSyncUnpackResult {
6363
otherEntriesDeleted: number;
6464
}
6565

66+
/**
67+
* Unpack a zipsync archive into the provided target directories.
68+
*/
6669
export function unpack({
6770
archivePath,
6871
targetDirectories: rawTargetDirectories,
@@ -76,11 +79,13 @@ export function unpack({
7679
markStart('unpack.total');
7780
terminal.writeDebugLine('Starting unpackZip');
7881

82+
// Read entire archive into memory (build cache entries are expected to be relatively small/medium).
7983
markStart('unpack.read.archive');
8084
const zipBuffer: Buffer = fs.readFileSync(archivePath);
8185
terminal.writeDebugLine(`Archive size=${zipBuffer.length} bytes`);
8286
markEnd('unpack.read.archive');
8387

88+
// Locate & parse central directory so we have random-access metadata for all entries.
8489
markStart('unpack.parse.centralDirectory');
8590
const zipTree: LookupByPath<boolean> = new LookupByPath();
8691
const endOfCentralDir: IEndOfCentralDirectory = findEndOfCentralDirectory(zipBuffer);
@@ -151,6 +156,7 @@ export function unpack({
151156

152157
terminal.writeLine(`Found ${entries.length} files in archive`);
153158

159+
// Ensure root target directories exist (they may be empty initially for cache misses).
154160
for (const targetDirectory of targetDirectories) {
155161
fs.mkdirSync(targetDirectory, { recursive: true });
156162
terminal.writeDebugLine(`Ensured target directory: ${targetDirectory}`);
@@ -165,6 +171,7 @@ export function unpack({
165171

166172
const dirsToCleanup: string[] = [];
167173

174+
// Phase: scan filesystem to delete entries not present in archive and record empty dirs for later removal.
168175
markStart('unpack.scan.existing');
169176
const queue: IDirQueueItem[] = targetDirectories.map((dir) => ({
170177
dir,
@@ -218,6 +225,7 @@ export function unpack({
218225
}
219226
}
220227

228+
// Try to delete now-empty directories (created in previous builds but not in this archive).
221229
for (const dir of dirsToCleanup) {
222230
// Try to remove the directory. If it is not empty, this will throw and we can ignore the error.
223231
if (rmdirSync(dir)) {
@@ -233,6 +241,10 @@ export function unpack({
233241

234242
const bufferSize: number = 1 << 25; // 32 MiB
235243
const outputBuffer: Buffer<ArrayBuffer> = Buffer.allocUnsafeSlow(bufferSize);
244+
/**
245+
* Stream-decompress (or copy) an individual file from the archive into place.
246+
* We allocate a single large output buffer reused for all inflation operations to limit GC.
247+
*/
236248
function extractFileFromZip(targetPath: string, entry: ICentralDirectoryHeaderParseResult): void {
237249
terminal.writeDebugLine(`Extracting file: ${entry.filename}`);
238250
const fileZipBuffer: Buffer = getFileFromZip(zipBuffer, entry);
@@ -275,6 +287,10 @@ export function unpack({
275287
}
276288
}
277289

290+
/**
291+
* Decide whether a file needs extraction by comparing existing file SHA‑1 vs metadata.
292+
* If file is missing or hash differs we extract; otherwise we skip to preserve existing inode/data.
293+
*/
278294
function shouldExtract(targetPath: string, entry: ICentralDirectoryHeaderParseResult): boolean {
279295
if (metadata) {
280296
const metadataFile: { size: number; sha1Hash: string } | undefined = metadata.files[entry.filename];
@@ -300,6 +316,7 @@ export function unpack({
300316

301317
const dirsCreated: Set<string> = new Set<string>();
302318

319+
// Iterate all entries excluding metadata; create parent dirs lazily; selective extraction.
303320
for (const entry of entries) {
304321
if (entry.filename === METADATA_FILENAME) {
305322
continue;

0 commit comments

Comments
 (0)