Skip to content

Commit d8f9411

Browse files
committed
Enhanced documentation on the test for count() recipe
1 parent ca777d6 commit d8f9411

File tree

2 files changed

+142
-14
lines changed

2 files changed

+142
-14
lines changed

mkdocs/docs/recipe-count.md

Lines changed: 71 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,96 @@
11
---
2-
title: Count Recipe
2+
title: Count Recipe - Efficiently Count Rows in Iceberg Tables
33
---
44

55
# Counting Rows in an Iceberg Table
66

7-
This recipe demonstrates how to use the `count()` function to efficiently count rows in an Iceberg table using PyIceberg.
7+
This recipe demonstrates how to use the `count()` function to efficiently count rows in an Iceberg table using PyIceberg. The count operation is optimized for performance by reading file metadata rather than scanning actual data.
8+
9+
## How Count Works
10+
11+
The `count()` method leverages Iceberg's metadata architecture to provide fast row counts by:
12+
13+
1. **Reading file manifests**: Examines metadata about data files without loading the actual data
14+
2. **Aggregating record counts**: Sums up record counts stored in Parquet file footers
15+
3. **Applying filters at metadata level**: Pushes down predicates to skip irrelevant files
16+
4. **Handling deletes**: Automatically accounts for delete files and tombstones
817

918
## Basic Usage
1019

11-
To count all rows in a table:
20+
Count all rows in a table:
1221

1322
```python
1423
from pyiceberg.catalog import load_catalog
1524

1625
catalog = load_catalog("default")
1726
table = catalog.load_table("default.cities")
1827

19-
row_count = table.count()
28+
# Get total row count
29+
row_count = table.scan().count()
2030
print(f"Total rows in table: {row_count}")
2131
```
2232

23-
## Count with a Filter
33+
## Count with Filters
2434

25-
To count only rows matching a filter:
35+
Count rows matching specific conditions:
2636

2737
```python
28-
from pyiceberg.expressions import EqualTo
38+
from pyiceberg.expressions import GreaterThan, EqualTo, And
2939

30-
count = table.scan(row_filter=EqualTo("city", "Amsterdam")).count()
31-
print(f"Rows with city == 'Amsterdam': {count}")
40+
# Count rows with population > 1,000,000
41+
large_cities = table.scan().filter(GreaterThan("population", 1000000)).count()
42+
print(f"Large cities: {large_cities}")
43+
44+
# Count rows with specific country and population criteria
45+
filtered_count = table.scan().filter(
46+
And(EqualTo("country", "Netherlands"), GreaterThan("population", 100000))
47+
).count()
48+
print(f"Dutch cities with population > 100k: {filtered_count}")
3249
```
3350

34-
## Notes
35-
- The `count()` method works for both catalog and static tables.
36-
- Filters can be applied using the `scan` API for more granular counts.
37-
- Deleted records are excluded from the count.
51+
## Performance Characteristics
52+
53+
The count operation is highly efficient because:
54+
55+
- **No data scanning**: Only reads metadata from file headers
56+
- **Parallel processing**: Can process multiple files concurrently
57+
- **Filter pushdown**: Eliminates files that don't match criteria
58+
- **Cached statistics**: Utilizes pre-computed record counts
59+
60+
## Test Scenarios
61+
62+
Our test suite validates count behavior across different scenarios:
63+
64+
### Basic Counting (test_count_basic)
65+
```python
66+
# Simulates a table with a single file containing 42 records
67+
assert table.scan().count() == 42
68+
```
69+
70+
### Empty Tables (test_count_empty)
71+
```python
72+
# Handles tables with no data files
73+
assert empty_table.scan().count() == 0
74+
```
75+
76+
### Large Datasets (test_count_large)
77+
```python
78+
# Aggregates counts across multiple files (2 files × 500,000 records each)
79+
assert large_table.scan().count() == 1000000
80+
```
81+
82+
## Best Practices
83+
84+
1. **Use count() for data validation**: Verify expected row counts after ETL operations
85+
2. **Combine with filters**: Get targeted counts without full table scans
86+
3. **Monitor table growth**: Track record counts over time for capacity planning
87+
4. **Validate partitions**: Count rows per partition to ensure balanced distribution
88+
89+
## Common Use Cases
90+
91+
- **Data quality checks**: Verify ETL job outputs
92+
- **Partition analysis**: Compare record counts across partitions
93+
- **Performance monitoring**: Track table growth and query patterns
94+
- **Cost estimation**: Understand data volume before expensive operations
3895

39-
For more details, see the [API documentation](api.md).
96+
For more details and complete API documentation, see the [API documentation](api.md#count-rows-in-a-table).

tests/table/test_count.py

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,63 @@
1+
"""
2+
Unit tests for the DataScan.count() method in PyIceberg.
3+
4+
The count() method is essential for determining the number of rows in an Iceberg table
5+
without having to load the actual data. It works by examining file metadata and task
6+
plans to efficiently calculate row counts across distributed data files.
7+
8+
These tests validate the count functionality across different scenarios:
9+
1. Basic counting with single file tasks
10+
2. Empty table handling (zero records)
11+
3. Large-scale counting with multiple file tasks
12+
13+
The tests use mocking to simulate different table states without requiring actual
14+
Iceberg table infrastructure, ensuring fast and isolated unit tests.
15+
"""
16+
117
import pytest
218
from unittest.mock import MagicMock, Mock, patch
319
from pyiceberg.table import DataScan
420
from pyiceberg.expressions import AlwaysTrue
521

22+
623
class DummyFile:
24+
"""
25+
Mock representation of an Iceberg data file.
26+
27+
In real scenarios, this would contain metadata about Parquet files
28+
including record counts, file paths, and statistics.
29+
"""
730
def __init__(self, record_count):
831
self.record_count = record_count
932

33+
1034
class DummyTask:
35+
"""
36+
Mock representation of a scan task in Iceberg query planning.
37+
38+
A scan task represents work to be done on a specific data file,
39+
including any residual filters and delete files that need to be applied.
40+
In actual usage, tasks are generated by the query planner based on
41+
partition pruning and filter pushdown optimizations.
42+
"""
1143
def __init__(self, record_count, residual=None, delete_files=None):
1244
self.file = DummyFile(record_count)
1345
self.residual = residual if residual is not None else AlwaysTrue()
1446
self.delete_files = delete_files or []
1547

1648
def test_count_basic():
49+
"""
50+
Test basic count functionality with a single file containing data.
51+
52+
This test verifies that the count() method correctly aggregates record counts
53+
from a single scan task. It simulates a table with one data file containing
54+
42 records and validates that the count method returns the correct total.
55+
56+
The test demonstrates the typical use case where:
57+
- A table has one or more data files
58+
- Each file has metadata containing record counts
59+
- The count() method aggregates these counts efficiently
60+
"""
1761
# Create a mock table with the necessary attributes
1862
table = Mock(spec=DataScan)
1963

@@ -27,7 +71,20 @@ def test_count_basic():
2771

2872
assert table.count() == 42
2973

74+
3075
def test_count_empty():
76+
"""
77+
Test count functionality on an empty table.
78+
79+
This test ensures that the count() method correctly handles empty tables
80+
that have no data files or scan tasks. It validates that an empty table
81+
returns a count of 0 without raising any errors.
82+
83+
This scenario is important for:
84+
- Newly created tables before any data is inserted
85+
- Tables where all data has been deleted
86+
- Tables with restrictive filters that match no data
87+
"""
3188
# Create a mock table with the necessary attributes
3289
table = Mock(spec=DataScan)
3390

@@ -40,7 +97,21 @@ def test_count_empty():
4097

4198
assert table.count() == 0
4299

100+
43101
def test_count_large():
102+
"""
103+
Test count functionality with multiple files containing large datasets.
104+
105+
This test validates that the count() method can efficiently handle tables
106+
with multiple data files and large record counts. It simulates a distributed
107+
scenario where data is split across multiple files, each containing 500,000
108+
records, for a total of 1 million records.
109+
110+
This test covers:
111+
- Aggregation across multiple scan tasks
112+
- Handling of large record counts (performance implications)
113+
- Distributed data scenarios common in big data environments
114+
"""
44115
# Create a mock table with the necessary attributes
45116
table = Mock(spec=DataScan)
46117

0 commit comments

Comments
 (0)