Skip to content

Latest commit

 

History

History
52 lines (36 loc) · 2.6 KB

File metadata and controls

52 lines (36 loc) · 2.6 KB

Generator Expressions for Memory-Efficient Pipelines

Problem

Python programs that process large datasets by building intermediate lists at each pipeline stage. The metric was peak memory usage. A pipeline like [transform(x) for x in data] followed by [x for x in result if predicate(x)] materializes the full list at each step, requiring O(N) memory per stage even when only one element is needed at a time.

What Worked

Replacing list comprehensions with generator expressions in pipeline stages that feed into the next stage or a final reduction. Generators produce one element at a time (lazy evaluation), keeping memory at O(1) regardless of input size.

For a pipeline processing a 50M-row CSV — filter → transform → aggregate — switching intermediate stages to generators reduced peak memory from 4.8 GB to 0.3 GB while maintaining identical throughput.

Experiment Data

Pipeline Style Peak Memory Time (s)
List comprehensions (3 stages) 4.8 GB 38.2
Generator expressions (3 stages) 0.3 GB 37.8
Generator + itertools.chain 0.3 GB 36.1

Code Example

# BAD: materializes entire list at each stage — 3x memory
filtered = [row for row in data if row['status'] == 'active']        # N items in memory
transformed = [compute(row) for row in filtered]                       # N items in memory
result = sum(t['value'] for t in transformed)                         # N items in memory

# GOOD: generators — O(1) memory for intermediate stages
filtered = (row for row in data if row['status'] == 'active')         # lazy
transformed = (compute(row) for row in filtered)                       # lazy
result = sum(t['value'] for t in transformed)                         # streams through

# Built-in functions that accept iterators directly:
# sum(), min(), max(), any(), all(), ''.join(), collections.Counter()
# These consume generators without materializing a list.

# File processing — never load entire file into memory
total = sum(
    len(line)
    for line in open('huge_file.txt')
    if not line.startswith('#')
)

What Didn't Work

  • Generators when you need multiple passes: A generator is exhausted after one iteration. If you need to iterate the data twice (e.g., compute mean then variance), you must either recreate the generator or materialize to a list. For multi-pass algorithms, itertools.tee helps but buffers elements internally, negating memory savings.

Environment

Python 3.8+. Generators are a core language feature, no dependencies needed. Memory measured with tracemalloc.get_traced_memory().