|
4 | 4 | - [📚API Reference](/documentation/api/core) |
5 | 5 | - [📁Files](/documentation/api/core/indices/files.html) |
6 | 6 |
|
7 | | -A Data Frame is a structured collection of tabular data, similar to a spreadsheet. |
8 | | -It organizes information into rows and columns, making it easy to understand, filter, and transform. |
9 | | -Using a Data Frame, you can quickly merge, clean, or modify data for your ETL processes, |
10 | | -allowing developers to focus more on transformations rather than low-level data handling. |
| 7 | +A Data Frame is the core component of Flow PHP's ETL framework. It represents a structured collection of tabular data that can be processed, transformed, and loaded efficiently. Think of it as a programmable spreadsheet that can handle large datasets with minimal memory footprint. |
11 | 8 |
|
12 | | -Unlike loading an entire dataset at once, a Data Frame processes information in smaller, manageable chunks. |
13 | | -As it moves through the data, it only keeps a limited number of rows in memory at any given time. |
14 | | -This approach helps avoid running out of memory, making it efficient and scalable for handling large datasets. |
| 9 | +## Key Features |
15 | 10 |
|
16 | | -Simple example of reading from php a array and writing to stdout. |
| 11 | +- **Memory Efficient**: Processes data in chunks using generators, avoiding memory exhaustion |
| 12 | +- **Lazy Evaluation**: Operations are only executed when needed |
| 13 | +- **Immutable**: Each transformation returns a new DataFrame instance |
| 14 | +- **Type Safe**: Strict typing throughout with comprehensive schema support |
| 15 | +- **Chainable API**: Fluent interface for building complex data pipelines |
| 16 | + |
| 17 | +## Understanding DataFrame Operations |
| 18 | + |
| 19 | +DataFrame methods fall into two categories based on when they execute: |
| 20 | + |
| 21 | +### Lazy Operations (`@lazy`) |
| 22 | + |
| 23 | +These methods build the processing pipeline without executing it immediately: |
| 24 | + |
| 25 | +- **Transformations**: `filter()`, `map()`, `withEntry()`, `select()`, `drop()`, `rename()` |
| 26 | +- **Memory-intensive**: `collect()`, `sortBy()`, `groupBy()`, `join()`, `cache()` |
| 27 | +- **Processing control**: `batchSize()`, `limit()`, `offset()`, `partitionBy()` |
| 28 | + |
| 29 | +### Trigger Operations (`@trigger`) |
| 30 | + |
| 31 | +These methods execute the entire pipeline and return results: |
| 32 | + |
| 33 | +- **Data retrieval**: `get()`, `getEach()`, `fetch()`, `count()` |
| 34 | +- **Output operations**: `run()`, `forEach()`, `printRows()`, `printSchema()` |
| 35 | +- **Schema inspection**: `schema()`, `display()` |
| 36 | + |
| 37 | +> **Important**: Build your complete pipeline with lazy operations, then execute once with a trigger operation for optimal performance. |
| 38 | +
|
| 39 | +## Creating DataFrames |
| 40 | + |
| 41 | +DataFrames are created using the `data_frame()` DSL function and populated with data through extractors. The framework supports various data sources through adapter-specific extractors. |
17 | 42 |
|
18 | 43 | ```php |
19 | 44 | <?php |
20 | 45 |
|
21 | | -data_frame() |
| 46 | +use function Flow\ETL\DSL\{data_frame, from_array, to_output}; |
| 47 | + |
| 48 | +$dataFrame = data_frame() |
22 | 49 | ->read(from_array([ |
23 | | - ['id' => 1], |
24 | | - ['id' => 2], |
25 | | - ['id' => 3], |
26 | | - ['id' => 4], |
27 | | - ['id' => 5], |
| 50 | + ['id' => 1, 'name' => 'John', 'age' => 30], |
| 51 | + ['id' => 2, 'name' => 'Jane', 'age' => 25], |
| 52 | + ['id' => 3, 'name' => 'Bob', 'age' => 35], |
28 | 53 | ])) |
29 | | - ->collect() |
30 | | - ->write(to_stream(__DIR__ . '/output.txt', truncate: false)) |
| 54 | + ->filter(col('age')->greaterThan(lit(25))) |
| 55 | + ->select('id', 'name') |
| 56 | + ->write(to_output()) |
31 | 57 | ->run(); |
32 | 58 | ``` |
| 59 | + |
| 60 | +> **Note**: Flow PHP supports many data sources through specialized adapters. See individual adapter documentation for specific extractor usage (CSV, JSON, Parquet, databases, APIs, etc.). |
| 61 | +
|
| 62 | +## Memory Management Best Practices |
| 63 | + |
| 64 | +1. **Prefer Generator Methods**: Use `get()`, `getEach()`, `getEachAsArray()` over `fetch()` for large datasets |
| 65 | +2. **Avoid Memory-Intensive Operations**: Be cautious with `collect()`, `sortBy()`, `groupBy()`, and `join()` on large datasets |
| 66 | +3. **Use Appropriate Batch Sizes**: Start with 1000-5000 rows and adjust based on your memory constraints |
| 67 | +4. **Monitor Memory Usage**: Use `run(analyze: true)` to track memory consumption during development |
| 68 | + |
| 69 | +## Performance Optimization |
| 70 | + |
| 71 | +- **Push Operations to Data Source**: When possible, perform filtering, sorting, and joins at the database/file level |
| 72 | +- **Minimize Data Movement**: Apply filters early in the pipeline to reduce data volume |
| 73 | +- **Cache Strategically**: Only cache expensive operations that will be reused multiple times |
| 74 | +- **Avoid Large Offsets**: Use data source pagination instead of DataFrame `offset()` for large skips |
| 75 | + |
| 76 | +## Component Documentation |
| 77 | + |
| 78 | +For detailed information about specific DataFrame operations, see the following component documentation: |
| 79 | + |
| 80 | +### Core Operations |
| 81 | +- **[Building Blocks](building-blocks.md)** - Understanding Rows, Entries, and basic data structures |
| 82 | +- **[Select/Drop](select-drop.md)** - Column selection and removal |
| 83 | +- **[Rename](rename.md)** - Column renaming strategies |
| 84 | +- **[Map](map.md)** - Row transformations and data mapping |
| 85 | +- **[Filter](filter.md)** - Row filtering and conditions |
| 86 | + |
| 87 | +### Data Processing |
| 88 | +- **[Join](join.md)** - DataFrame joining operations |
| 89 | +- **[Group By](group-by.md)** - Grouping and aggregation operations |
| 90 | +- **[Pivot](pivot.md)** - Transform data from long to wide format |
| 91 | +- **[Sort](sort.md)** - Data sorting |
| 92 | +- **[Limit](limit.md)** - Result limiting and pagination |
| 93 | +- **[Offset](offset.md)** - Skipping rows and pagination |
| 94 | +- **[Until](until.md)** - Conditional processing termination |
| 95 | +- **[Window Functions](window-functions.md)** - Advanced analytical functions |
| 96 | + |
| 97 | +### Memory & Performance |
| 98 | +- **[Batch Processing](batch-processing.md)** - Controlling batch sizes and memory collection |
| 99 | +- **[Partitioning](partitioning.md)** - Data partitioning for efficient processing |
| 100 | +- **[Caching](caching.md)** - Performance optimization through caching |
| 101 | +- **[Data Retrieval](data-retrieval.md)** - Methods for getting processed data |
| 102 | + |
| 103 | +### Data Quality & Validation |
| 104 | +- **[Schema](schema.md)** - Schema management and validation |
| 105 | +- **[Constraints](constraints.md)** - Data integrity constraints and business rules |
| 106 | +- **[Error Handling](error-handling.md)** - Error management strategies |
| 107 | + |
| 108 | +### Output & Display |
| 109 | +- **[Display](display.md)** - Data visualization and output |
0 commit comments