Python programs that process large datasets by building intermediate lists at each pipeline stage. The metric was peak memory usage. A pipeline like [transform(x) for x in data] followed by [x for x in result if predicate(x)] materializes the full list at each step, requiring O(N) memory per stage even when only one element is needed at a time.
Replacing list comprehensions with generator expressions in pipeline stages that feed into the next stage or a final reduction. Generators produce one element at a time (lazy evaluation), keeping memory at O(1) regardless of input size.
For a pipeline processing a 50M-row CSV — filter → transform → aggregate — switching intermediate stages to generators reduced peak memory from 4.8 GB to 0.3 GB while maintaining identical throughput.
| Pipeline Style | Peak Memory | Time (s) |
|---|---|---|
| List comprehensions (3 stages) | 4.8 GB | 38.2 |
| Generator expressions (3 stages) | 0.3 GB | 37.8 |
Generator + itertools.chain |
0.3 GB | 36.1 |
# BAD: materializes entire list at each stage — 3x memory
filtered = [row for row in data if row['status'] == 'active'] # N items in memory
transformed = [compute(row) for row in filtered] # N items in memory
result = sum(t['value'] for t in transformed) # N items in memory
# GOOD: generators — O(1) memory for intermediate stages
filtered = (row for row in data if row['status'] == 'active') # lazy
transformed = (compute(row) for row in filtered) # lazy
result = sum(t['value'] for t in transformed) # streams through
# Built-in functions that accept iterators directly:
# sum(), min(), max(), any(), all(), ''.join(), collections.Counter()
# These consume generators without materializing a list.
# File processing — never load entire file into memory
total = sum(
len(line)
for line in open('huge_file.txt')
if not line.startswith('#')
)- Generators when you need multiple passes: A generator is exhausted after one iteration. If you need to iterate the data twice (e.g., compute mean then variance), you must either recreate the generator or materialize to a list. For multi-pass algorithms,
itertools.teehelps but buffers elements internally, negating memory savings.
Python 3.8+. Generators are a core language feature, no dependencies needed. Memory measured with tracemalloc.get_traced_memory().