Skip to content

Optimize Streaming Write Performance for large Data Sets#2315

Closed
ChronosMasterOfAllTime wants to merge 4 commits into
qax-os:masterfrom
ChronosMasterOfAllTime:pr-1-streaming-write-perf
Closed

Optimize Streaming Write Performance for large Data Sets#2315
ChronosMasterOfAllTime wants to merge 4 commits into
qax-os:masterfrom
ChronosMasterOfAllTime:pr-1-streaming-write-perf

Conversation

@ChronosMasterOfAllTime
Copy link
Copy Markdown

@ChronosMasterOfAllTime ChronosMasterOfAllTime commented Apr 30, 2026

Description

Replaces #2288 in smaller chunks

Performance and memory optimization of the streaming write path (StreamWriter), the WriteTo/WriteToBuffer output pipeline, and the ZIP compression layer. Profiling against the mzimmerman/excelizetest benchmark suite identified four hot spots — ColumnNumberToName, CoordinatesToCellName, SetRow, and writeCell — which together accounted for the majority of CPU time and nearly all heap allocations per row.

Summary of improvements

Area Key metric
SetRow hot path 68–79% faster, 94–99% fewer allocations
Full pipeline (SetRow + WriteTo) 67–72% faster, 51–87% less memory
XML-escaped strings 79% faster, 81% less memory
Peak memory (50K×100) 162 MB → 43 MB (−73%)
Allocations (50K×100) 15.1M → 153K (−99%)
ZIP compression klauspost/compress: ~2× faster than stdlib

Changes

lib.go — precomputed column names

  • Added columnNames: a package-level precomputed lookup table of all 16 384 column name strings (A–XFD), initialized once at startup via an IIFE. ColumnNumberToName now returns a slice element instead of allocating a new string on every call.
  • Optimised CoordinatesToCellName to early-return on the common (non-absolute) path, avoiding concatenation with an empty sign variable.
  • Updated readXML to use the new bufferedWriter.Bytes() method instead of accessing .buf directly.
  • Replaced archive/zip import with github.com/klauspost/compress/zip.

stream.go — hot-loop optimizations

  • SetRow rewrite: precomputes rowStr once per row (previously per cell via CoordinatesToCellName), reuses a single xlsxC struct across the inner loop (fields zeroed per iteration instead of allocating a new struct), and reads column names directly from the columnNames table.
  • writeNumericCell: zero-allocation fast path for int, int8int64, uintuint64, float32, float64, and bool. Writes the complete <c> element directly to the buffer using strconv.Append* into a [24]byte scratch field, bypassing xlsxC entirely.
  • writeStringCell: zero-allocation fast path for string and []byte. Writes inline-string XML directly, bypassing xlsxC/xlsxSI/trimCellValue/bstrMarshal/xml.EscapeText. Handles xml:space="preserve" for leading/trailing whitespace. Falls back to slow path for _xHHHH_ escape patterns or strings exceeding TotalCellChars.
  • writeEscaped: custom XML escaper that scans for <>&"\r; fast-path writes directly when no special chars are found, slow-path does character-by-character replacement (still zero-alloc).
  • writeCellStart: helper to deduplicate the <c r="…" opening across fast and slow paths.
  • writeCell: now takes *xlsxC by pointer plus pre-split colName/rowStr, eliminating a ~184-byte struct copy and a string concatenation per cell. Skips xml.EscapeText for numeric/boolean <v> values (digits, ., -, +, E are always safe).
  • setCellValFunc: inlines all integer type cases directly, removing a redundant second type-switch dispatch through setCellIntFunc.
  • marshalAttrs: writes directly to *bufferedWriter (no intermediate strings.Builder). Row option validation split into validateRowOpts so XML is only written after validation passes.
  • parseRowOpts: returns RowOpts by value instead of *RowOpts.
  • streamCellStyle / colStyles []int: column style lookup is now O(1) via a cached slice built once in writeSheetData, replacing a per-cell O(N) linear scan of worksheet.Cols.

stream.go — bufferedWriter memory architecture

  • Two-phase architecture: below the threshold, all writes go to an in-memory bytes.Buffer. Once the threshold is crossed, the buffer is drained to a temp file exactly once and all subsequent writes flow through a fixed-size bufio.Writer wrapping the file. This bounds peak heap usage to approximately StreamingChunkSize + bioSize regardless of total data size, compared to the previous approach which re-grew a new bytes.Buffer to the threshold size on every flush cycle.
  • Sync() is now a no-op when bio != nilbufio.Writer flushes internally when its buffer is full; forcing a flush on every SetRow call (the previous behavior) negated all batching benefit.
  • New methods: Bytes(), Reset(), CopyTo(w io.Writer), WriteInt(int64), WriteUint(uint64), WriteFloat(float64, ...).
  • CopyTo: uses a 256 KiB buffered reader to minimize Pread syscalls when copying from temp files (reduces ~3000 syscalls to ~400 for a 100 MB worksheet).
  • scratch [24]byte: used by WriteInt, WriteUint, WriteFloat to format numbers without heap allocation.

file.go — streaming WriteTo & compression

  • WriteTo rewrite: non-encrypted path now streams the ZIP directly to w via a countWriter wrapper — no intermediate bytes.Buffer. Encrypted path delegates to new writeToWithEncryption, which writes ZIP to a temp file, applies ZIP64 LFH fixup, reads back, encrypts, and writes to w.
  • WriteToBuffer: now calls configureZipCompression and only performs ZIP64 LFH fixup when len(f.zip64Entries) > 0.
  • writeToZip: replaces stream.rawData.Reader() + io.Copy with stream.rawData.CopyTo(fi) (uses the new efficient copy path).
  • writeZip64LFHFile: performs ZIP64 local file header fixup on a temp file (chunk-based, 1 MB reads) instead of requiring an in-memory buffer.

excelize.go — options & compression

  • New type Compression int with three constants: CompressionDefault, CompressionNone, CompressionBestSpeed.
  • New fields on Options: StreamingChunkSize int, StreamingBufSize int, Compression Compression. All are zero-by-default (zero → use package constants), so existing callers are completely unaffected. StreamingChunkSize: -1 keeps all data in memory (never spills to disk).
  • configureZipCompression: registers a custom flate.NewWriter compressor on *zip.Writer based on the Compression option.
  • Replaced archive/zip import with github.com/klauspost/compress/zip and github.com/klauspost/compress/flate.

templates.go

  • Added StreamingBufSizeDefault = 128 << 10 (128 KiB). Value determined empirically via BenchmarkBioSizeSweep and TestBioSizeIOProfile.

go.mod

  • Added github.com/klauspost/compress v1.18.5 — a high-performance, pure-Go drop-in replacement for archive/zip and compress/flate.

Related Issue

Fixes #876 — High memory when writing 1 million number of rows

Motivation and Context

User-reported profiling showed that generating large worksheets via StreamWriter was dominated by per-cell allocations in the column-name conversion functions and by unbounded bytes.Buffer growth in the write buffer. For a 100-column × 50 000-row sheet (~150 MB of XML), the previous code allocated 162 MB peak and made 15.1 million allocations. The WriteTo path then buffered the entire compressed ZIP in a bytes.Buffer before writing, adding another 50–200 MB of peak memory on top.

How Has This Been Tested

All existing tests pass (go test ./...).

New tests

Test What it covers
TestStreamingWriteTo Verifies WriteTo streams correctly without password; round-trips 100×10 sheet
TestCompressionOption Generates 500×20 sheet at Default/None/BestSpeed; asserts size ordering; validates all are readable XLSX
TestWriteToBufferCompression Verifies WriteToBuffer respects CompressionNone
TestWriteToWithPassword Round-trips encrypted file via WriteTo with password
TestWriteToWithPasswordAndCompression Combines password encryption with CompressionBestSpeed
TestBioSizeIOProfile Instruments write-syscall counts and bytes at 10 bufio.Writer sizes (4 KiB – 4 MiB) to project performance on different storage tiers
BenchmarkBioSizeSweep Measures ns/op and B/op across 10 bufio.Writer sizes for a 50K×100 sheet
BenchmarkStringCellClean/Special Measures writeEscaped fast path vs slow path
BenchmarkCompressionLevels 50K×20 string sheet at Default/BestSpeed/None, each with disk-spill and in-memory variants (6 sub-benchmarks)
BenchmarkStreamWriterLarge/Huge 10K×50 and 50K×100 integer-cell benchmarks for regression tracking
BenchmarkExcelize* (9 sizes) Full pipeline (build data + SetRow + WriteTo) adapted from mzimmerman/excelizetest

Benchmark results

Platform: Apple M1 Pro, macOS, Go 1.24, arm64
Methodology: go test -run=^$ -bench=... -benchmem -count=3, median of 3 runs

Streaming write path (SetRow + Flush + Close, no WriteTo)

Benchmark master ns/op PR ns/op Δ CPU master B/op PR B/op Δ Mem master allocs PR allocs Δ Allocs
StreamWriter (100×10) 238.7 µs 64.2 µs −73% 101.9 KB 86.1 KB −16% 2.3K 147 −94%
StreamWriterLarge (50K×10) 74.9 ms 24.2 ms −68% 55.0 MB 42.4 MB −23% 1.65M 33.3K −98%
StreamWriterHuge (50K×100) 851.9 ms 271.1 ms −68% 162.2 MB 43.2 MB −73% 15.14M 153.3K −99%
StringCellClean (50K×10) 181.1 ms 61.2 ms −66% 163.6 MB 42.5 MB −74% 3.65M 33.3K −99%
StringCellSpecial (50K×10) 337.7 ms 70.7 ms −79% 223.8 MB 42.5 MB −81% 5.15M 33.3K −99%

Full pipeline (build string data + SetRow + WriteTo to buffer)

Benchmark master ns/op PR ns/op Δ CPU master B/op PR B/op Δ Mem master allocs PR allocs Δ Allocs
Excelize 1K×10 9.1 ms 2.7 ms −70% 4.9 MB 2.1 MB −57% 86.0K 16.9K −80%
Excelize 10K×10 66.1 ms 21.5 ms −67% 47.4 MB 23.3 MB −51% 833.0K 133.9K −84%
Excelize 100K×100 7.17 s 2.03 s −72% 2.54 GB 339.7 MB −87% 80.29M 10.30M −87%

Compression options (PR only, 50K×20 string rows, full WriteTo)

Mode Time Memory Allocs
Default (temp file) 383.5 ms 68.7 MB 1.15M
BestSpeed (temp file) 285.9 ms 83.7 MB 1.15M
None (temp file) 249.7 ms 213.6 MB 1.15M
Default (in-memory) 271.3 ms 194.2 MB 1.15M
BestSpeed (in-memory) 254.1 ms 209.1 MB 1.15M
None (in-memory) 178.8 ms 339.1 MB 1.15M
Metric master (sum) PR (sum) Δ
CPU time 8.69 s 2.48 s −71%
Memory 3.20 GB 536 MB −83%
Allocations 106.8M 10.7M −90%

Real-world scenario (the Excelize 100K×100 full pipeline) is:

  • 7.17 s → 2.03 s (−72% CPU)
  • 2.54 GB → 340 MB (−87% memory)
  • 80.3M → 10.3M allocs (−87%)

Key takeaways

  • StreamWriterHuge (50K×100): −68% CPU, −73% memory (162 MB → 43 MB), −99% allocs (15.1M → 153K)
  • Excelize 100K×100 full pipeline: 7.17 s → 2.03 s (−72%), 2.54 GB → 340 MB (−87%), 80M → 10M allocs (−87%)
  • XML-escaped strings (50K×10): −79% CPU, −81% memory — the writeEscaped zero-alloc path eliminates per-character allocations
  • Allocation reduction is the biggest win across the board: 94–99% fewer allocs in all streaming benchmarks
  • Memory is now bounded: peak ≈ StreamingChunkSize + bioSize regardless of total data size

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@xuri xuri added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 30, 2026
@AdamDrewsTR
Copy link
Copy Markdown
Contributor

@ChronosMasterOfAllTime Is the plan to then merge in the other features: read performance + file size optimization using shared strings, etc?

@ChronosMasterOfAllTime
Copy link
Copy Markdown
Author

@ChronosMasterOfAllTime Is the plan to then merge in the other features: read performance + file size optimization using shared strings, etc?

Yes exactly; break down the features piecemeal. So it's easier to troubleshoot

@AdamDrewsTR
Copy link
Copy Markdown
Contributor

@xuri Exactly what corruption did you see where? I am not able to replicate any issues that don't already exist in master.

@AdamDrewsTR
Copy link
Copy Markdown
Contributor

@xuri Please close. I will break up more granular.

@xuri xuri reopened this May 21, 2026
@xuri xuri closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High memory when writing 1million number of rows

3 participants