Skip to content

Commit c6a7759

Browse files
authored
Updated DataFrame documentation (#1793)
* Updated DataFrame documentation * Updated documentation navigation
1 parent cc582e3 commit c6a7759

17 files changed

Lines changed: 839 additions & 159 deletions
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Batch Processing
2+
3+
- [⬅️️ Back](/documentation/components/core/core.md)
4+
5+
Batch processing controls how data flows through the DataFrame pipeline, affecting memory usage and performance.
6+
7+
## Batch Size Control
8+
9+
### batchSize() - Control processing chunks
10+
11+
```php
12+
<?php
13+
14+
use function Flow\ETL\DSL\{data_frame, from_array, to_output};
15+
16+
$dataFrame = data_frame()
17+
->read(from_array($largeDataset))
18+
->batchSize(1000) // Process in batches of 1000 rows
19+
->map($expensiveTransformation)
20+
->write(to_output())
21+
->run();
22+
```
23+
24+
> **Performance Tip**: Optimal batch size depends on your data and available memory. Larger batches reduce I/O
25+
> operations but increase memory usage. Start with 1000-5000 rows and adjust based on your specific use case.
26+
27+
## Data Collection
28+
29+
### collect() - Load all data into memory
30+
31+
```php
32+
<?php
33+
34+
$dataFrame = data_frame()
35+
->read($extractor)
36+
->filter($condition)
37+
->collect() // Collect all filtered data into single batch
38+
->sortBy(col('name')) // Now can sort the collected data
39+
->write($loader)
40+
->run();
41+
```
42+
43+
> **⚠️ Memory Warning**: The `collect()` method loads all data into memory at once. This can cause memory exhaustion
44+
> with large datasets. Use only when:
45+
> - You're certain the entire dataset fits comfortably in available memory
46+
> - You need operations that require all data (like sorting)
47+
> - You're working with small to medium datasets
48+
49+
## Memory Management Strategies
50+
51+
## Monitoring Memory Usage
52+
53+
```php
54+
<?php
55+
56+
use function Flow\ETL\DSL\analyze;
57+
58+
$report = data_frame()
59+
->read($extractor)
60+
->batchSize(1000)
61+
->map($transformation)
62+
->write($loader)
63+
->run(analyze: analzyze());
64+
65+
echo "Peak memory usage: " . $report->statistics()->memory->max()->inMb() . " bytes\n";
66+
```

documentation/components/core/building-blocks.md

Lines changed: 6 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22

33
- [⬅️️ Back](/documentation/components/core/core.md)
44

5-
Entries are the columns of the data frame, they are represented by the [Entry](/src/core/etl/src/Flow/ETL/Row/Entry.php) interface.
5+
Entries are the columns of the [Data Frame](/documentation/components/core/core.md), they are represented by
6+
the [Entry](/src/core/etl/src/Flow/ETL/Row/Entry.php) interface.
67
Group of Entries is called `Row`, it is represented by the [Row](/src/core/etl/src/Flow/ETL/Row.php) class.
78
Group of Rows is called `Rows`, it is represented by the [Rows](/src/core/etl/src/Flow/ETL/Rows.php) class.
89

9-
Let's look at the following example:
10+
Let's look at the following example:
1011

1112
```php
1213
<?php
@@ -24,7 +25,7 @@ $rows = rows(
2425
```
2526

2627
Rows are the main data structure in Flow ETL, they’re used to represent data in the data frame.
27-
Extractors are yielding Rows and Loaders are saving Rows.
28+
Extractors are yielding Rows and Loaders are saving Rows.
2829

2930
The same can be achieved using the following code:
3031

@@ -61,20 +62,7 @@ $rows = array_to_rows([
6162
- [XML](/src/core/etl/src/Flow/ETL/Row/Entry/XMLEntry.php)
6263
- [XMLElement](/src/core/etl/src/Flow/ETL/Row/Entry/XMLElementEntry.php)
6364

64-
Internally flow is using [EntryFactory](/src/core/etl/src/Flow/ETL/Row/EntryFactory.php) to create entries.
65+
Internally flow is using [EntryFactory](/src/core/etl/src/Flow/ETL/Row/EntryFactory.php) to create entries.
6566
It will try to detect and create the most appropriate entry type based on the value.
6667

67-
Flow Entries are based on [PHP Types](/src/core/etl/src/Flow/ETL/PHP/Type/Type.php), which are divided into two groups:
68-
69-
- Native
70-
- Array
71-
- Callable
72-
- Enum
73-
- Object
74-
- Resource
75-
- Scalar
76-
- Logical
77-
- List
78-
- Map
79-
- Structure
80-
68+
Flow Entries are based on [Flow Types Library](/documentation/components/libs/types.md)
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Constraints
2+
3+
- [⬅️️ Back](/documentation/components/core/core.md)
4+
5+
Data constraints allow you to apply business rules and data integrity checks to ensure data quality during processing. When a constraint is violated, a `ConstraintViolationException` is thrown with details about the violating row.
6+
7+
## Unique Constraints
8+
9+
Ensure that values in specified columns are unique across the entire dataset.
10+
11+
```php
12+
<?php
13+
14+
use function Flow\ETL\DSL\{data_frame, from_array, constraint_unique, to_output};
15+
16+
$dataFrame = data_frame()
17+
->read(from_array([
18+
['email' => 'user1@example.com', 'username' => 'user1'],
19+
['email' => 'user2@example.com', 'username' => 'user2'],
20+
['email' => 'user1@example.com', 'username' => 'user3'], // This will cause constraint violation
21+
]))
22+
->constrain(constraint_unique('email'))
23+
->write(to_output())
24+
->run();
25+
```
26+
27+
## Multiple Column Constraints
28+
29+
Ensure unique combinations across multiple columns:
30+
31+
```php
32+
<?php
33+
34+
use function Flow\ETL\DSL\{constraint_unique};
35+
36+
$dataFrame = data_frame()
37+
->read($extractor)
38+
->constrain(constraint_unique('username', 'tenant_id'))
39+
->write($loader)
40+
->run();
41+
```
42+
43+
## Custom Constraints
44+
45+
You can implement custom constraints by creating classes that implement the `Constraint` interface:
46+
47+
```php
48+
<?php
49+
50+
use Flow\ETL\{Constraint, Row};
51+
52+
class AgeRangeConstraint implements Constraint
53+
{
54+
public function __construct(private int $minAge, private int $maxAge) {}
55+
56+
public function isSatisfiedBy(Row $row): bool
57+
{
58+
$age = $row->get('age')->value();
59+
return $age >= $this->minAge && $age <= $this->maxAge;
60+
}
61+
62+
public function toString(): string
63+
{
64+
return "Age must be between {$this->minAge} and {$this->maxAge}";
65+
}
66+
67+
public function violation(Row $row): string
68+
{
69+
return "Age {$row->get('age')->value()} is outside allowed range";
70+
}
71+
}
72+
```

documentation/components/core/core.md

Lines changed: 93 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,29 +4,106 @@
44
- [📚API Reference](/documentation/api/core)
55
- [📁Files](/documentation/api/core/indices/files.html)
66

7-
A Data Frame is a structured collection of tabular data, similar to a spreadsheet.
8-
It organizes information into rows and columns, making it easy to understand, filter, and transform.
9-
Using a Data Frame, you can quickly merge, clean, or modify data for your ETL processes,
10-
allowing developers to focus more on transformations rather than low-level data handling.
7+
A Data Frame is the core component of Flow PHP's ETL framework. It represents a structured collection of tabular data that can be processed, transformed, and loaded efficiently. Think of it as a programmable spreadsheet that can handle large datasets with minimal memory footprint.
118

12-
Unlike loading an entire dataset at once, a Data Frame processes information in smaller, manageable chunks.
13-
As it moves through the data, it only keeps a limited number of rows in memory at any given time.
14-
This approach helps avoid running out of memory, making it efficient and scalable for handling large datasets.
9+
## Key Features
1510

16-
Simple example of reading from php a array and writing to stdout.
11+
- **Memory Efficient**: Processes data in chunks using generators, avoiding memory exhaustion
12+
- **Lazy Evaluation**: Operations are only executed when needed
13+
- **Immutable**: Each transformation returns a new DataFrame instance
14+
- **Type Safe**: Strict typing throughout with comprehensive schema support
15+
- **Chainable API**: Fluent interface for building complex data pipelines
16+
17+
## Understanding DataFrame Operations
18+
19+
DataFrame methods fall into two categories based on when they execute:
20+
21+
### Lazy Operations (`@lazy`)
22+
23+
These methods build the processing pipeline without executing it immediately:
24+
25+
- **Transformations**: `filter()`, `map()`, `withEntry()`, `select()`, `drop()`, `rename()`
26+
- **Memory-intensive**: `collect()`, `sortBy()`, `groupBy()`, `join()`, `cache()`
27+
- **Processing control**: `batchSize()`, `limit()`, `offset()`, `partitionBy()`
28+
29+
### Trigger Operations (`@trigger`)
30+
31+
These methods execute the entire pipeline and return results:
32+
33+
- **Data retrieval**: `get()`, `getEach()`, `fetch()`, `count()`
34+
- **Output operations**: `run()`, `forEach()`, `printRows()`, `printSchema()`
35+
- **Schema inspection**: `schema()`, `display()`
36+
37+
> **Important**: Build your complete pipeline with lazy operations, then execute once with a trigger operation for optimal performance.
38+
39+
## Creating DataFrames
40+
41+
DataFrames are created using the `data_frame()` DSL function and populated with data through extractors. The framework supports various data sources through adapter-specific extractors.
1742

1843
```php
1944
<?php
2045

21-
data_frame()
46+
use function Flow\ETL\DSL\{data_frame, from_array, to_output};
47+
48+
$dataFrame = data_frame()
2249
->read(from_array([
23-
['id' => 1],
24-
['id' => 2],
25-
['id' => 3],
26-
['id' => 4],
27-
['id' => 5],
50+
['id' => 1, 'name' => 'John', 'age' => 30],
51+
['id' => 2, 'name' => 'Jane', 'age' => 25],
52+
['id' => 3, 'name' => 'Bob', 'age' => 35],
2853
]))
29-
->collect()
30-
->write(to_stream(__DIR__ . '/output.txt', truncate: false))
54+
->filter(col('age')->greaterThan(lit(25)))
55+
->select('id', 'name')
56+
->write(to_output())
3157
->run();
3258
```
59+
60+
> **Note**: Flow PHP supports many data sources through specialized adapters. See individual adapter documentation for specific extractor usage (CSV, JSON, Parquet, databases, APIs, etc.).
61+
62+
## Memory Management Best Practices
63+
64+
1. **Prefer Generator Methods**: Use `get()`, `getEach()`, `getEachAsArray()` over `fetch()` for large datasets
65+
2. **Avoid Memory-Intensive Operations**: Be cautious with `collect()`, `sortBy()`, `groupBy()`, and `join()` on large datasets
66+
3. **Use Appropriate Batch Sizes**: Start with 1000-5000 rows and adjust based on your memory constraints
67+
4. **Monitor Memory Usage**: Use `run(analyze: true)` to track memory consumption during development
68+
69+
## Performance Optimization
70+
71+
- **Push Operations to Data Source**: When possible, perform filtering, sorting, and joins at the database/file level
72+
- **Minimize Data Movement**: Apply filters early in the pipeline to reduce data volume
73+
- **Cache Strategically**: Only cache expensive operations that will be reused multiple times
74+
- **Avoid Large Offsets**: Use data source pagination instead of DataFrame `offset()` for large skips
75+
76+
## Component Documentation
77+
78+
For detailed information about specific DataFrame operations, see the following component documentation:
79+
80+
### Core Operations
81+
- **[Building Blocks](building-blocks.md)** - Understanding Rows, Entries, and basic data structures
82+
- **[Select/Drop](select-drop.md)** - Column selection and removal
83+
- **[Rename](rename.md)** - Column renaming strategies
84+
- **[Map](map.md)** - Row transformations and data mapping
85+
- **[Filter](filter.md)** - Row filtering and conditions
86+
87+
### Data Processing
88+
- **[Join](join.md)** - DataFrame joining operations
89+
- **[Group By](group-by.md)** - Grouping and aggregation operations
90+
- **[Pivot](pivot.md)** - Transform data from long to wide format
91+
- **[Sort](sort.md)** - Data sorting
92+
- **[Limit](limit.md)** - Result limiting and pagination
93+
- **[Offset](offset.md)** - Skipping rows and pagination
94+
- **[Until](until.md)** - Conditional processing termination
95+
- **[Window Functions](window-functions.md)** - Advanced analytical functions
96+
97+
### Memory & Performance
98+
- **[Batch Processing](batch-processing.md)** - Controlling batch sizes and memory collection
99+
- **[Partitioning](partitioning.md)** - Data partitioning for efficient processing
100+
- **[Caching](caching.md)** - Performance optimization through caching
101+
- **[Data Retrieval](data-retrieval.md)** - Methods for getting processed data
102+
103+
### Data Quality & Validation
104+
- **[Schema](schema.md)** - Schema management and validation
105+
- **[Constraints](constraints.md)** - Data integrity constraints and business rules
106+
- **[Error Handling](error-handling.md)** - Error management strategies
107+
108+
### Output & Display
109+
- **[Display](display.md)** - Data visualization and output

0 commit comments

Comments
 (0)