Skip to content

Commit dfe6f64

Browse files
authored
Merge pull request #11 from Alokzh/docs/improve-dataframe-vs-buffer-blogpost
Improve DataFrame vs Circular Buffer comparison
2 parents 62c604e + cbe2a9d commit dfe6f64

1 file changed

Lines changed: 78 additions & 92 deletions

File tree

docs/DataFrame vs Buffer.md

Lines changed: 78 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# DataFrames vs Circular Buffers: Processing Large Datasets Efficiently in Pharo
22

3-
When processing large datasets in Pharo, developers typically reach for DataFrames and for good reason. They provide a powerful way to manipulate and analyze data, similar to how you would in Python or R. However, there's a hidden problem that many developers overlook: DataFrames can be inefficient and even detrimental to performance when dealing with large files.
3+
When processing large datasets in Pharo, developers typically reach for DataFrames and for good reason. They provide a powerful way to manipulate and analyze data, similar to proven solutions like [Pandas in Python](https://pandas.pydata.org/), [data.frame in R](https://www.r-project.org/), or [Pharo's DataFrame implementation](https://github.com/PolyMathOrg/DataFrame). DataFrames excel at complex operations like joins, grouping, statistical analysis, and data exploration. However, not all data processing tasks require loading an entire dataset into memory. For certain problems, especially those involving massive files and streaming computations, a more specialized tool can offer significant performance benefits. This is where a circular buffer shines.
44

5-
In this article, we'll explore why DataFrames can hurt performance, how circular buffers can be a better alternative, and provide practical examples to illustrate the differences. We'll also cover how to generate test data and measure performance effectively.
5+
In this article, we'll explore when each approach shines, provide practical examples with realistic large datasets, and demonstrate how choosing the right data structure can dramatically improve performance for specific use cases.
66

7-
## The Hidden Problem: Why DataFrames Can Hurt Performance
8-
Let me show you a common scenario that many developers face when working with DataFrames. Imagine you have a CSV file with stock prices, and you want to calculate the average price. Here's how you might do it using DataFrames:
7+
## Understanding the Trade-offs
8+
9+
Let's consider a common task: Calculating the average price from a CSV file of stock data. With a DataFrame, the approach is straightforward. Here's how you might do it using DataFrames:
910

1011
```smalltalk
1112
stockDataFile := 'stock_data.csv'.
@@ -17,17 +18,14 @@ averagePrice := priceColumn average.
1718
Transcript show: 'Total Average Price: ', averagePrice asString; cr.
1819
```
1920

20-
This works perfectly for small files. But here's what happens when your CSV file grows from 1MB to 100MB to 1GB:
21+
This works beautifully for moderately sized files. But what happens when your dataset isn't just a few megabytes, but grows to 20GB, 100GB, or even larger? In these real-world scenarios, loading the entire file into memory can lead to significant challenges:
2122

2223
**The Problems:**
23-
- **Memory Overhead**: DataFrames load the entire file into memory, which can be **100x larger** than the file size itself. For a 16MB file, you might end up using over 12GB of RAM.
2424
- **Garbage Collection**: As the DataFrame grows, garbage collection kicks in frequently, slowing down your program. This is because DataFrames create many temporary objects that need to be cleaned up.
2525
- **Performance**: Calculating the average involves iterating through potentially millions of rows, which can take a long time. The more data you have, the slower it gets.
26-
- **Crashes**: If the file is larger than your available RAM, you get "Out of Memory" errors, causing your program to crash.
27-
28-
29-
Think of it like this: you want to know the average height of people in a room, so you ask everyone to stand in line and write down everyone's details in a notebook. That's what DataFrames do - they store everything, even when you only need one number.
26+
- **Extreme Memory Pressure**: If a dataset is larger than the available RAM, the operating system starts using the hard disk as virtual memory. This process will make your program very slow, as disk access is orders of magnitude slower than RAM access.
3027

28+
Think of it this way: to find the average height of people in a large crowd, you don't need a detailed notebook with everyone's name, age, and address. You just need to sum their heights and divide by the count. The circular buffer approach is like having a simple calculator and notepad, it only keeps the information essential for the immediate task.
3129

3230
## Moving Averages: A Practical Example
3331
Let's take this a step further and calculate moving averages, which is a common task in financial applications. We'll compare the DataFrame approach with the circular buffer approach.
@@ -38,7 +36,7 @@ Let's generate some realistic test data to see how these two approaches perform
3836
| stockDataFile rowCount windowSize |
3937
4038
stockDataFile := 'moving_avg_test.csv'.
41-
rowCount := 500000.
39+
rowCount := 8000000.
4240
windowSize := 100.
4341
4442
"Clean up any old files"
@@ -48,37 +46,41 @@ stockDataFile asFileReference exists ifTrue: [
4846
4947
Transcript show: 'BENCHMARK: Moving Average - DataFrame vs Streaming Buffer'; cr.
5048
51-
"Generate simple test data"
5249
Transcript show: 'Generating test data...'; cr.
5350
stockDataFile asFileReference writeStreamDo: [ :stream |
5451
| currentPrice |
55-
currentPrice := 100.0.
56-
stream nextPutAll: 'S.No.,Price,Low,High'; cr.
57-
52+
currentPrice := 150.0.
53+
54+
stream nextPutAll: 'S.No.,Symbol,Price,Volume,Low,High,OpenInterest'; cr.
55+
5856
1 to: rowCount do: [ :day |
59-
| priceChange newPrice lowPrice highPrice |
60-
priceChange := (Random new next - 0.5) * 2.0. "±$1 change"
61-
newPrice := currentPrice + priceChange.
62-
63-
"Generate realistic Low and High values around the price"
64-
lowPrice := newPrice - (Random new next * 2.0). "Low is below price"
65-
highPrice := newPrice + (Random new next * 2.0). "High is above price"
66-
67-
stream nextPutAll: day asString, ',',
68-
(newPrice roundTo: 0.01) asString, ',',
69-
(lowPrice roundTo: 0.01) asString, ',',
70-
(highPrice roundTo: 0.01) asString; cr.
57+
| priceChange newPrice lowPrice highPrice volume openInterest |
58+
priceChange := (Random new next - 0.5) * 2.0.
59+
newPrice := (currentPrice + priceChange) max: 1.0.
60+
lowPrice := newPrice - (Random new next * 2.0).
61+
highPrice := newPrice + (Random new next * 2.0).
7162
63+
volume := 50000 + (1950000 atRandom) rounded.
64+
openInterest := 1000 + (49000 atRandom) rounded.
65+
66+
stream
67+
nextPutAll: day asString; nextPut: $,;
68+
nextPutAll: 'PHARO-STOCK'; nextPut: $,;
69+
nextPutAll: (newPrice roundTo: 0.01) asString; nextPut: $,;
70+
nextPutAll: volume asString; nextPut: $,;
71+
nextPutAll: (lowPrice roundTo: 0.01) asString; nextPut: $,;
72+
nextPutAll: (highPrice roundTo: 0.01) asString; nextPut: $,;
73+
nextPutAll: openInterest asString; cr.
74+
7275
currentPrice := newPrice.
7376
].
7477
].
7578
Transcript show: 'Generated file: ', (stockDataFile asFileReference size / 1024 / 1024) rounded asString, ' MB'; cr; cr.
7679
```
77-
This code generates a CSV file with 500,000 rows of realistic stock market data, including serial numbers, prices, daily lows, and highs. Each price is a random fluctuation around the previous day's price.
78-
## Performance Testing: The Numbers Tell the Story
7980

80-
> **Important Note for Benchmarking**: Before running these benchmarks, please disable background processes & ensure your system is not under heavy load. This will help ensure accurate and consistent benchmark results.
81+
This code generates a CSV file with 8,000,000 rows of stock data, simulating daily price changes. Each row contains columns for the stock symbol, price, volume, low, high, and open interest.
8182

83+
## Performance Testing: The Numbers Tell the Story
8284
Now let's compare the performance of the DataFrame approach with the circular buffer approach for calculating moving averages.
8385

8486
### Test Setup
@@ -87,13 +89,12 @@ Now let's compare the performance of the DataFrame approach with the circular bu
8789
Transcript show: 'Testing DataFrame approach...'; cr.
8890
3 timesRepeat: [ Smalltalk garbageCollect ].
8991
[
90-
| startTime endTime allocatedMemory numberOfScavenges numberOfFullGCs totalGCTime
91-
stockData priceColumn movingAverages |
92+
| startTime endTime memoryBefore memoryAfter gcBefore gcAfter gcTimeBefore gcTimeAfter
93+
stockData priceColumn movingAverages |
9294
93-
allocatedMemory := Smalltalk vm parameterAt: 34.
94-
numberOfScavenges := Smalltalk vm parameterAt: 9.
95-
numberOfFullGCs := Smalltalk vm parameterAt: 7.
96-
totalGCTime := (Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10).
95+
memoryBefore := Smalltalk vm parameterAt: 3.
96+
gcBefore := (Smalltalk vm parameterAt: 7) + (Smalltalk vm parameterAt: 9).
97+
gcTimeBefore := (Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10).
9798
startTime := Time millisecondClockValue.
9899
99100
"Load entire dataset"
@@ -114,17 +115,14 @@ Transcript show: 'Testing DataFrame approach...'; cr.
114115
].
115116
116117
endTime := Time millisecondClockValue.
117-
allocatedMemory := (Smalltalk vm parameterAt: 34) - allocatedMemory.
118-
numberOfScavenges := (Smalltalk vm parameterAt: 9) - numberOfScavenges.
119-
numberOfFullGCs := (Smalltalk vm parameterAt: 7) - numberOfFullGCs.
120-
totalGCTime := ((Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10)) - totalGCTime.
121-
118+
memoryAfter := Smalltalk vm parameterAt: 3.
119+
gcAfter := (Smalltalk vm parameterAt: 7) + (Smalltalk vm parameterAt: 9).
120+
gcTimeAfter := (Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10).
122121
Transcript show: 'DataFrame Test Results:'; cr.
123122
Transcript show: ' Time: ', (endTime - startTime) asString, ' ms'; cr.
124-
Transcript show: ' Memory Allocated: ', (allocatedMemory / 1024 / 1024) rounded asString, ' MB'; cr.
125-
Transcript show: ' Scavenges: ', numberOfScavenges asString; cr.
126-
Transcript show: ' Full GCs: ', numberOfFullGCs asString; cr.
127-
Transcript show: ' Total GC Time: ', totalGCTime asString, ' ms'; cr.
123+
Transcript show: ' Memory: ', ((memoryAfter - memoryBefore) / 1024 / 1024) rounded asString, ' MB'; cr.
124+
Transcript show: ' GC Events: ', (gcAfter - gcBefore) asString; cr.
125+
Transcript show: ' GC Time: ', (gcTimeAfter - gcTimeBefore) asString, ' ms'; cr.
128126
Transcript show: ' Moving Averages: ', movingAverages size asString; cr.
129127
Transcript show: ' Final MA: $', (movingAverages last roundTo: 0.01) asString; cr; cr.
130128
@@ -139,35 +137,29 @@ Transcript show: 'Testing Buffer approach...'; cr.
139137
140138
3 timesRepeat: [ Smalltalk garbageCollect ].
141139
[
142-
| startTime endTime allocatedMemory numberOfScavenges numberOfFullGCs totalGCTime
143-
priceBuffer movingAverages processedCount |
144-
145-
allocatedMemory := Smalltalk vm parameterAt: 34.
146-
numberOfScavenges := Smalltalk vm parameterAt: 9.
147-
numberOfFullGCs := Smalltalk vm parameterAt: 7.
148-
totalGCTime := (Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10).
140+
| startTime endTime memoryBefore memoryAfter gcBefore gcAfter gcTimeBefore gcTimeAfter priceBuffer movingAverages |
141+
142+
memoryBefore := Smalltalk vm parameterAt: 3.
143+
gcBefore := (Smalltalk vm parameterAt: 7) + (Smalltalk vm parameterAt: 9).
144+
gcTimeBefore := (Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10).
149145
startTime := Time millisecondClockValue.
150146
151147
priceBuffer := CTFIFOBuffer new: windowSize.
152148
movingAverages := OrderedCollection new.
153-
processedCount := 0.
154-
149+
155150
stockDataFile asFileReference readStreamDo: [ :fileStream |
156-
| line |
157151
fileStream atEnd ifFalse: [ fileStream nextLine ].
158152
159153
[ fileStream atEnd ] whileFalse: [
154+
| line |
160155
line := fileStream nextLine.
161156
162157
line ifNotEmpty: [
163158
| csvParts price |
164159
csvParts := line splitOn: ','.
165-
price := (csvParts at: 2) asNumber. "Price is 2nd column"
166-
160+
price := (csvParts at: 3) asNumber.
167161
priceBuffer push: price.
168-
processedCount := processedCount + 1.
169-
170-
"Calculate moving average when buffer is full"
162+
171163
priceBuffer isFull ifTrue: [
172164
| bufferSum movingAvg |
173165
bufferSum := 0.
@@ -180,17 +172,15 @@ Transcript show: 'Testing Buffer approach...'; cr.
180172
].
181173
182174
endTime := Time millisecondClockValue.
183-
allocatedMemory := (Smalltalk vm parameterAt: 34) - allocatedMemory.
184-
numberOfScavenges := (Smalltalk vm parameterAt: 9) - numberOfScavenges.
185-
numberOfFullGCs := (Smalltalk vm parameterAt: 7) - numberOfFullGCs.
186-
totalGCTime := ((Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10)) - totalGCTime.
175+
memoryAfter := Smalltalk vm parameterAt: 3.
176+
gcAfter := (Smalltalk vm parameterAt: 7) + (Smalltalk vm parameterAt: 9).
177+
gcTimeAfter := (Smalltalk vm parameterAt: 8) + (Smalltalk vm parameterAt: 10).
187178
188179
Transcript show: 'Buffer Results:'; cr.
189180
Transcript show: ' Time: ', (endTime - startTime) asString, ' ms'; cr.
190-
Transcript show: ' Memory Allocated: ', (allocatedMemory / 1024 / 1024) rounded asString, ' MB'; cr.
191-
Transcript show: ' Scavenges: ', numberOfScavenges asString; cr.
192-
Transcript show: ' Full GCs: ', numberOfFullGCs asString; cr.
193-
Transcript show: ' Total GC Time: ', totalGCTime asString, ' ms'; cr.
181+
Transcript show: ' Memory: ', ((memoryAfter - memoryBefore) / 1024 / 1024) rounded asString, ' MB'; cr.
182+
Transcript show: ' GC Events: ', (gcAfter - gcBefore) asString; cr.
183+
Transcript show: ' GC Time: ', (gcTimeAfter - gcTimeBefore) asString, ' ms'; cr.
194184
Transcript show: ' Moving Averages: ', movingAverages size asString; cr.
195185
Transcript show: ' Final MA: $', (movingAverages last roundTo: 0.01) asString; cr; cr.
196186
] value.
@@ -202,53 +192,49 @@ Transcript show: 'Tests Done!'; cr.
202192
```
203193

204194
### Benchmark Results
205-
Here are the results from running this benchmark on a 500,000-row dataset (approximately 15MB file):
195+
196+
Here are the results from running this benchmark on a 8,000,000 row dataset (approximately 450MB file):
197+
206198
| Metric | DataFrame | Circular Buffer | Improvement |
207199
|--------|-----------|-----------------|-------------|
208-
| **Execution Time** | 14,986 ms | 2,136 ms | **7.0x faster** |
209-
| **Memory Allocated** | 12,914 MB | 785 MB | **16.4x less memory** |
210-
| **Scavenges** | 867 | 52 | **94% fewer** |
211-
| **Full GCs** | 4 | 0 | **100% fewer** |
212-
| **Total GC Time** | 3,769 ms | 3 ms | **1,256x less GC overhead** |
213-
| **Results Generated** | 499,901 | 499,901 | Identical accuracy |
200+
| **Execution Time** | ~968,804 ms | ~46,258 ms | **21x faster** |
201+
| **Memory Usage** | ~1,184 MB | ~112 MB | **10.6x less memory** |
202+
| **GC Events** | ~14,943 | ~944 | **94% fewer** |
203+
| **GC Time** | ~826,382 ms | ~405 ms | **2,040x less GC overhead** |
204+
| **Results Generated** | 7,999,901 | 7,999,901 | Identical accuracy |
214205

215206
*Note: Results may vary based on your hardware, Pharo version, and system load*
216207

217208
**Key Insights:**
218-
- Circular buffers processed the same data **7.0x faster**
219-
- Used **16.4x less memory** despite processing the same amount of data
220-
- Had **94% fewer scavenges and eliminated all full GCs**, leading to smoother performance
221-
- Spent virtually no time on garbage collection (3ms vs 3.8+ seconds)
209+
- Circular buffers processed the same data **21x faster**
210+
- Used **10.6x less memory** despite processing the same amount of data
211+
- Had **94% fewer garbage collection events**, leading to smoother performance
212+
- Spent virtually no time on memory cleanup (405ms vs 826+ seconds)
222213
- Produced identical results, proving accuracy isn't compromised
223214

224215
## When to Use Each Approach
216+
Choosing between a DataFrame and a Circular Buffer isn't about which is "better", it's about picking the right tool for the job.
225217

226218
### Use DataFrames When:
227-
- Your entire dataset comfortably fits in memory (under 100MB typically)
219+
- Your entire dataset comfortably fits in memory
228220
- You need complex operations like joins, group-by, or statistical functions
229221
- You're doing one-time analysis where you explore data interactively
230-
- You need to sort, filter, or query data in complex ways
231222

232223
### Use Circular Buffers When:
233-
- Processing large files (over 100MB)
234-
- Computing simple statistics (averages, sums, counts)
235-
- Building real-time systems that process continuous data streams
236-
- Working with memory-limited environments
237224
- Processing data bigger than your available RAM
225+
- Working with memory-limited environments
226+
- You only need to look at a small, sliding "window" of data at a time (like for a moving average)
238227

239228
## Try It Yourself
240229

241-
1. Generate some test data using the code above
242-
2. Run both approaches on files of different sizes
243-
3. Watch the memory usage and timing differences
244-
4. See how circular buffers handle files bigger than your RAM
230+
1. Generate test data using the code above
231+
2. Run both approaches & compare the results
245232

246233
You'll be surprised at how much faster your programs can run when you choose the right data structure for the job.
247234

248235
## Summary
249-
In this article, we explored the limitations of DataFrames when processing large datasets in Pharo and introduced circular buffers as a more efficient alternative. We demonstrated how circular buffers can handle large files without running out of memory, while also being significantly faster for simple computations like averages and moving averages.
250-
251-
We also provided practical examples of generating test data and measuring performance for both approaches. The key takeaway is that sometimes the simplest solution is the fastest solution, especially when dealing with large datasets.
236+
DataFrames are, and will remain, an essential, powerful, and correct choice for a wide range of data analysis tasks. Their flexibility for interactive exploration is unmatched when working with datasets that fit in memory.
252237

238+
However, when dataset is massive & we need only simple manipulations like calculating moving averages which we could do with a sliding window without loading everything into memory, a circular buffer can provide significant performance benefits. The key takeaway is to understand your data processing needs and choose the right tool for the job.
253239

254240
Want to explore more? Check out the [Containers-Buffer repository](https://github.com/pharo-containers/Containers-Buffer) to see the complete implementation and examples.

0 commit comments

Comments
 (0)