Skip to content

[feature] zstandard compression for text history #4149

@dgtlmoon

Description

@dgtlmoon

❯ zstandard looks cool, in my case its important that it somehow cant get corrupted if a write fails, disk is full, app is termniated half way through saving etc

● For crash safety with zstd, the trick is one zstd frame per version rather than one long stream:

  • zstd frames are concatenable — cat v1.zst v2.zst v3.zst is a valid zstd stream that decodes to the joined output.
  • Each frame has its own content checksum (enable write_checksum=True).
  • If the app dies mid-append, the last frame is truncated/corrupt. On read, you iterate frames and stop cleanly at the first one that fails — everything before it is
    intact.

import zstandard as zstd
cctx = zstd.ZstdCompressor(level=19, write_checksum=True)
with open("versions.zst", "ab") as f:
f.write(cctx.compress(new_version_bytes))
f.flush(); os.fsync(f.fileno())

To still get good cross-version compression with independent frames, train a dictionary from a sample of versions once (zstd.train_dictionary) and reuse it for every
frame. Independent frames + shared dict ≈ most of the savings, none of the fragility.

Other crash-safe options to consider:

  • SQLite (WAL mode) + zstd blobs per row. SQLite gives you real transactional atomicity — partial writes simply don't appear. Probably the most bulletproof if you
    don't mind the dependency.
  • One file per version with atomic rename (tmp → os.replace). POSIX rename is atomic; either the new version is fully there or it isn't. Loses some compression vs.
    shared dict, but trivially correct.

If "append-only single file" is the hard requirement, frame-per-version with a trained dict + fsync is what I'd reach for.

✻ Brewed for 21s

❯ ive seen sqlite do weird things inside of docker containers..

● Yeah, that's a real thing — SQLite's locking relies on fcntl semantics that get flaky on bind mounts, overlayfs, NFS, and especially Docker-Desktop's host mounts on
macOS/Windows. WAL mode in particular doesn't play well with non-local filesystems.

The frame-per-version zstd file dodges all of that — it's just O_APPEND + fsync, which is about as filesystem-portable as it gets. The "is the tail truncated?"
question becomes a simple checksum check rather than a locking/journal recovery dance.

One extra belt-and-suspenders touch if you want it: prefix each frame with a fixed 4-byte length header before writing. zstd frames are self-delimiting so you don't
strictly need it, but it makes the "scan forward, skip a bad frame, recover later good ones" path easier if you ever care about partial recovery beyond the first
corruption point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions