Skip to content

Commit 710d0ee

Browse files
connorsheaclaude
andcommitted
Optimize Sheet#rows_generator hot path
Benchmarked on a generated 20,000-row x 40-col xlsx (mixed shared-string, numeric, and empty cells). Median full-parse time dropped ~1.68x (1.85s -> 1.11s) and allocations dropped ~3x (28.6M -> ~9M), with byte-identical output verified on both default- and prefixed-namespace files across rows, simple_rows, and rows_with_meta_data. Changes, in order of impact: - Resolve the namespace prefix once via a `namespace_resolved` flag instead of re-checking `node.namespaces` on every node. Worksheets use a default namespace, so the old `prefix.empty?` guard never latched and allocated a namespaces hash for every node in the stream. This is the bulk of the wall-clock win. - Hoist `node.name`/`node.node_type` into locals (each was read up to 4x per node in the if/elsif chain) and reorder branches so the hottest nodes (<v>/<c>) are tested first. - Use `node.attribute_hash` instead of `node.attributes` for cell and row nodes. `Reader#attributes` is `attribute_hash.merge(namespaces)`, so it built and merged a namespaces hash on every call; we only need the element's own attributes. This is an allocation/GC win (drops Hash#merge and the per-node namespaces hash) with negligible wall-clock change. Safe because xlsx row/cell elements never declare their own namespaces (the namespace hash is empty for them), so the result is identical -- confirmed byte-for-byte including the self-closing-row + meta-data path. - In fill_in_empty_cells, drop the redundant `.to_a` on the column range and use `delete_suffix` instead of `gsub` to strip the row number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 39bffae commit 710d0ee

1 file changed

Lines changed: 34 additions & 25 deletions

File tree

lib/creek/sheet.rb

Lines changed: 34 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -101,30 +101,48 @@ def rows_generator(include_meta_data = false, use_simple_rows_format = false)
101101
cell_type = nil
102102
cell_style_idx = nil
103103
@book.files.file.open(path) do |xml|
104-
prefix = ''
104+
namespace_resolved = false
105105
name_row = 'row'
106106
name_c = 'c'
107107
name_v = 'v'
108108
name_t = 't'
109109
Nokogiri::XML::Reader.from_io(xml).each do |node|
110-
if prefix.empty? && node.namespaces.any?
110+
# Resolve the namespace prefix once, from the first element that
111+
# declares the spreadsheetml namespace (the worksheet root). Caching
112+
# this avoids allocating a namespaces hash for every node in the stream.
113+
if !namespace_resolved && node.namespaces.any?
111114
namespace = node.namespaces.detect { |_key, uri| uri == SPREADSHEETML_URI }
112-
prefix = if namespace && namespace[0].start_with?('xmlns:')
113-
namespace[0].delete_prefix('xmlns:') + ':'
114-
else
115-
''
116-
end
117-
name_row = "#{prefix}row"
118-
name_c = "#{prefix}c"
119-
name_v = "#{prefix}v"
120-
name_t = "#{prefix}t"
115+
if namespace
116+
prefix = namespace[0].start_with?('xmlns:') ? namespace[0].delete_prefix('xmlns:') + ':' : ''
117+
name_row = "#{prefix}row"
118+
name_c = "#{prefix}c"
119+
name_v = "#{prefix}v"
120+
name_t = "#{prefix}t"
121+
namespace_resolved = true
122+
end
121123
end
122-
if node.name == name_row && node.node_type == opener
123-
row = node.attributes
124+
125+
node_name = node.name
126+
node_type = node.node_type
127+
128+
if node_type == opener && (node_name == name_v || node_name == name_t)
129+
unless cell.nil?
130+
node.read
131+
cells[cell] = convert(node.value, cell_type, cell_style_idx)
132+
end
133+
elsif node_name == name_c && node_type == opener
134+
# attribute_hash avoids the namespaces lookup + merge that
135+
# Reader#attributes performs on every call; we only need t/s/r.
136+
attributes = node.attribute_hash
137+
cell_type = attributes['t']
138+
cell_style_idx = attributes['s']
139+
cell = attributes['r']
140+
elsif node_name == name_row && node_type == opener
141+
row = node.attribute_hash
124142
row['cells'] = {}
125143
cells = {}
126144
y << (include_meta_data ? row : cells) if node.self_closing?
127-
elsif node.name == name_row && node.node_type == closer
145+
elsif node_name == name_row && node_type == closer
128146
processed_cells = fill_in_empty_cells(cells, row['r'], cell, use_simple_rows_format)
129147
@headers = processed_cells if with_headers && row['r'] == HEADERS_ROW_NUMBER
130148

@@ -138,15 +156,6 @@ def rows_generator(include_meta_data = false, use_simple_rows_format = false)
138156

139157
row['cells'] = processed_cells
140158
y << (include_meta_data ? row : processed_cells)
141-
elsif node.name == name_c && node.node_type == opener
142-
cell_type = node.attributes['t']
143-
cell_style_idx = node.attributes['s']
144-
cell = node.attributes['r']
145-
elsif (node.name == name_v || node.name == name_t) && node.node_type == opener
146-
unless cell.nil?
147-
node.read
148-
cells[cell] = convert(node.value, cell_type, cell_style_idx)
149-
end
150159
end
151160
end
152161
end
@@ -172,8 +181,8 @@ def fill_in_empty_cells(cells, row_number, last_col, use_simple_rows_format)
172181
new_cells = {}
173182
return new_cells if cells.empty?
174183

175-
last_col = last_col.gsub(row_number, '')
176-
('A'..last_col).to_a.each do |column|
184+
last_col = last_col.delete_suffix(row_number)
185+
('A'..last_col).each do |column|
177186
id = cell_id(column, use_simple_rows_format, row_number)
178187
new_cells[id] = cells["#{column}#{row_number}"]
179188
end

0 commit comments

Comments
 (0)