You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# DataFrames vs Circular Buffers: Processing Large Datasets Efficiently in Pharo
2
2
3
-
When processing large datasets in Pharo, developers typically reach for DataFrames and for good reason. They provide a powerful way to manipulate and analyze data, similar to how you would in Python or R. However, there's a hidden problem that many developers overlook: DataFrames can be inefficient and even detrimental to performance when dealing with large files.
3
+
When processing large datasets in Pharo, developers typically reach for DataFrames and for good reason. They provide a powerful way to manipulate and analyze data, similar to proven solutions like [Pandas in Python](https://pandas.pydata.org/), [data.frame in R](https://www.r-project.org/), or [Pharo's DataFrame implementation](https://github.com/PolyMathOrg/DataFrame). DataFrames excel at complex operations like joins, grouping, statistical analysis, and data exploration. However, not all data processing tasks require loading an entire dataset into memory. For certain problems, especially those involving massive files and streaming computations, a more specialized tool can offer significant performance benefits. This is where a circular buffer shines.
4
4
5
-
In this article, we'll explore why DataFrames can hurt performance, how circular buffers can be a better alternative, and provide practical examples to illustrate the differences. We'll also cover how to generate test data and measure performance effectively.
5
+
In this article, we'll explore when each approach shines, provide practical examples with realistic large datasets, and demonstrate how choosing the right data structure can dramatically improve performance for specific use cases.
6
6
7
-
## The Hidden Problem: Why DataFrames Can Hurt Performance
8
-
Let me show you a common scenario that many developers face when working with DataFrames. Imagine you have a CSV file with stock prices, and you want to calculate the average price. Here's how you might do it using DataFrames:
7
+
## Understanding the Trade-offs
8
+
9
+
Let's consider a common task: Calculating the average price from a CSV file of stock data. With a DataFrame, the approach is straightforward. Here's how you might do it using DataFrames:
Transcript show: 'Total Average Price: ', averagePrice asString; cr.
18
19
```
19
20
20
-
This works perfectly for small files. But here's what happens when your CSV file grows from 1MB to 100MB to 1GB:
21
+
This works beautifully for moderately sized files. But what happens when your dataset isn't just a few megabytes, but grows to 20GB, 100GB, or even larger? In these real-world scenarios, loading the entire file into memory can lead to significant challenges:
21
22
22
23
**The Problems:**
23
-
-**Memory Overhead**: DataFrames load the entire file into memory, which can be **100x larger** than the file size itself. For a 16MB file, you might end up using over 12GB of RAM.
24
24
-**Garbage Collection**: As the DataFrame grows, garbage collection kicks in frequently, slowing down your program. This is because DataFrames create many temporary objects that need to be cleaned up.
25
25
-**Performance**: Calculating the average involves iterating through potentially millions of rows, which can take a long time. The more data you have, the slower it gets.
26
-
-**Crashes**: If the file is larger than your available RAM, you get "Out of Memory" errors, causing your program to crash.
27
-
28
-
29
-
Think of it like this: you want to know the average height of people in a room, so you ask everyone to stand in line and write down everyone's details in a notebook. That's what DataFrames do - they store everything, even when you only need one number.
26
+
-**Extreme Memory Pressure**: If a dataset is larger than the available RAM, the operating system starts using the hard disk as virtual memory. This process will make your program very slow, as disk access is orders of magnitude slower than RAM access.
30
27
28
+
Think of it this way: to find the average height of people in a large crowd, you don't need a detailed notebook with everyone's name, age, and address. You just need to sum their heights and divide by the count. The circular buffer approach is like having a simple calculator and notepad, it only keeps the information essential for the immediate task.
31
29
32
30
## Moving Averages: A Practical Example
33
31
Let's take this a step further and calculate moving averages, which is a common task in financial applications. We'll compare the DataFrame approach with the circular buffer approach.
@@ -38,7 +36,7 @@ Let's generate some realistic test data to see how these two approaches perform
This code generates a CSV file with 500,000 rows of realistic stock market data, including serial numbers, prices, daily lows, and highs. Each price is a random fluctuation around the previous day's price.
78
-
## Performance Testing: The Numbers Tell the Story
79
80
80
-
> **Important Note for Benchmarking**: Before running these benchmarks, please disable background processes & ensure your system is not under heavy load. This will help ensure accurate and consistent benchmark results.
81
+
This code generates a CSV file with 8,000,000 rows of stock data, simulating daily price changes. Each row contains columns for the stock symbol, price, volume, low, high, and open interest.
81
82
83
+
## Performance Testing: The Numbers Tell the Story
82
84
Now let's compare the performance of the DataFrame approach with the circular buffer approach for calculating moving averages.
83
85
84
86
### Test Setup
@@ -87,13 +89,12 @@ Now let's compare the performance of the DataFrame approach with the circular bu
- Building real-time systems that process continuous data streams
236
-
- Working with memory-limited environments
237
224
- Processing data bigger than your available RAM
225
+
- Working with memory-limited environments
226
+
- You only need to look at a small, sliding "window" of data at a time (like for a moving average)
238
227
239
228
## Try It Yourself
240
229
241
-
1. Generate some test data using the code above
242
-
2. Run both approaches on files of different sizes
243
-
3. Watch the memory usage and timing differences
244
-
4. See how circular buffers handle files bigger than your RAM
230
+
1. Generate test data using the code above
231
+
2. Run both approaches & compare the results
245
232
246
233
You'll be surprised at how much faster your programs can run when you choose the right data structure for the job.
247
234
248
235
## Summary
249
-
In this article, we explored the limitations of DataFrames when processing large datasets in Pharo and introduced circular buffers as a more efficient alternative. We demonstrated how circular buffers can handle large files without running out of memory, while also being significantly faster for simple computations like averages and moving averages.
250
-
251
-
We also provided practical examples of generating test data and measuring performance for both approaches. The key takeaway is that sometimes the simplest solution is the fastest solution, especially when dealing with large datasets.
236
+
DataFrames are, and will remain, an essential, powerful, and correct choice for a wide range of data analysis tasks. Their flexibility for interactive exploration is unmatched when working with datasets that fit in memory.
252
237
238
+
However, when dataset is massive & we need only simple manipulations like calculating moving averages which we could do with a sliding window without loading everything into memory, a circular buffer can provide significant performance benefits. The key takeaway is to understand your data processing needs and choose the right tool for the job.
253
239
254
240
Want to explore more? Check out the [Containers-Buffer repository](https://github.com/pharo-containers/Containers-Buffer) to see the complete implementation and examples.
0 commit comments