Skip to content

Commit b7cfb4a

Browse files
committed
Merge branch 'main' into feat/metadata_tables
2 parents fe51b41 + dc6d242 commit b7cfb4a

30 files changed

+825
-301
lines changed

.github/ISSUE_TEMPLATE/iceberg_bug_report.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ body:
99
description: What Apache Iceberg version are you using?
1010
multiple: false
1111
options:
12-
- "0.7.0 (latest release)"
12+
- "0.7.1 (latest release)"
13+
- "0.7.0"
1314
- "0.6.1"
1415
- "0.6.0"
1516
- "0.5.0"

.markdownlint.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
# Default state for all rules
19+
default: true
20+
21+
# MD013/line-length - Line length
22+
MD013: false
23+
24+
# MD007/ul-indent - Unordered list indentation
25+
MD007:
26+
indent: 4

.pre-commit-config.yaml

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -46,17 +46,11 @@ repos:
4646
hooks:
4747
- id: pycln
4848
args: [--config=pyproject.toml]
49-
- repo: https://github.com/executablebooks/mdformat
50-
rev: 0.7.17
49+
- repo: https://github.com/igorshubovych/markdownlint-cli
50+
rev: v0.41.0
5151
hooks:
52-
- id: mdformat
53-
additional_dependencies:
54-
- mdformat-black==0.1.1
55-
- mdformat-config==0.1.3
56-
- mdformat-beautysh==0.1.1
57-
- mdformat-admon==1.0.1
58-
- mdformat-mkdocs==1.0.1
59-
- mdformat-frontmatter==2.0.1
52+
- id: markdownlint
53+
args: ["--fix"]
6054
- repo: https://github.com/pycqa/pydocstyle
6155
rev: 6.3.0
6256
hooks:

Makefile

Lines changed: 24 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,28 +15,37 @@
1515
# specific language governing permissions and limitations
1616
# under the License.
1717

18-
install-poetry:
19-
pip install poetry==1.8.3
2018

21-
install-dependencies:
22-
poetry install -E pyarrow -E hive -E s3fs -E glue -E adlfs -E duckdb -E ray -E sql-postgres -E gcsfs -E sql-sqlite -E daft
19+
help: ## Display this help
20+
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m\033[0m\n"} /^[a-zA-Z_-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
21+
22+
install-poetry: ## Install poetry if the user has not done that yet.
23+
@if ! command -v poetry &> /dev/null; then \
24+
echo "Poetry could not be found. Installing..."; \
25+
pip install --user poetry==1.8.3; \
26+
else \
27+
echo "Poetry is already installed."; \
28+
fi
29+
30+
install-dependencies: ## Install dependencies including dev and all extras
31+
poetry install --all-extras
2332

2433
install: | install-poetry install-dependencies
2534

26-
check-license:
35+
check-license: ## Check license headers
2736
./dev/check-license
2837

29-
lint:
38+
lint: ## lint
3039
poetry run pre-commit run --all-files
3140

32-
test:
41+
test: ## Run all unit tests, can add arguments with PYTEST_ARGS="-vv"
3342
poetry run pytest tests/ -m "(unmarked or parametrize) and not integration" ${PYTEST_ARGS}
3443

35-
test-s3:
44+
test-s3: # Run tests marked with s3, can add arguments with PYTEST_ARGS="-vv"
3645
sh ./dev/run-minio.sh
3746
poetry run pytest tests/ -m s3 ${PYTEST_ARGS}
3847

39-
test-integration:
48+
test-integration: ## Run all integration tests, can add arguments with PYTEST_ARGS="-vv"
4049
docker compose -f dev/docker-compose-integration.yml kill
4150
docker compose -f dev/docker-compose-integration.yml rm -f
4251
docker compose -f dev/docker-compose-integration.yml up -d
@@ -50,18 +59,18 @@ test-integration-rebuild:
5059
docker compose -f dev/docker-compose-integration.yml rm -f
5160
docker compose -f dev/docker-compose-integration.yml build --no-cache
5261

53-
test-adlfs:
62+
test-adlfs: ## Run tests marked with adlfs, can add arguments with PYTEST_ARGS="-vv"
5463
sh ./dev/run-azurite.sh
5564
poetry run pytest tests/ -m adlfs ${PYTEST_ARGS}
5665

57-
test-gcs:
66+
test-gcs: ## Run tests marked with gcs, can add arguments with PYTEST_ARGS="-vv"
5867
sh ./dev/run-gcs-server.sh
5968
poetry run pytest tests/ -m gcs ${PYTEST_ARGS}
6069

61-
test-coverage-unit:
70+
test-coverage-unit: # Run test with coverage for unit tests, can add arguments with PYTEST_ARGS="-vv"
6271
poetry run coverage run --source=pyiceberg/ --data-file=.coverage.unit -m pytest tests/ -v -m "(unmarked or parametrize) and not integration" ${PYTEST_ARGS}
6372

64-
test-coverage-integration:
73+
test-coverage-integration: # Run test with coverage for integration tests, can add arguments with PYTEST_ARGS="-vv"
6574
docker compose -f dev/docker-compose-integration.yml kill
6675
docker compose -f dev/docker-compose-integration.yml rm -f
6776
docker compose -f dev/docker-compose-integration.yml up -d
@@ -72,14 +81,14 @@ test-coverage-integration:
7281
docker compose -f dev/docker-compose-integration.yml exec -T spark-iceberg ipython ./provision.py
7382
poetry run coverage run --source=pyiceberg/ --data-file=.coverage.integration -m pytest tests/ -v -m integration ${PYTEST_ARGS}
7483

75-
test-coverage: | test-coverage-unit test-coverage-integration
84+
test-coverage: | test-coverage-unit test-coverage-integration ## Run all tests with coverage including unit and integration tests
7685
poetry run coverage combine .coverage.unit .coverage.integration
7786
poetry run coverage report -m --fail-under=90
7887
poetry run coverage html
7988
poetry run coverage xml
8089

8190

82-
clean:
91+
clean: ## Clean up the project Python working environment
8392
@echo "Cleaning up Cython and Python cached files"
8493
@rm -rf build dist *.egg-info
8594
@find . -name "*.so" -exec echo Deleting {} \; -delete

dev/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ WORKDIR ${SPARK_HOME}
3939
ENV SPARK_VERSION=3.5.0
4040
ENV ICEBERG_SPARK_RUNTIME_VERSION=3.5_2.12
4141
ENV ICEBERG_VERSION=1.6.0
42-
ENV PYICEBERG_VERSION=0.7.0
42+
ENV PYICEBERG_VERSION=0.7.1
4343

4444
RUN curl --retry 3 -s -C - https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
4545
&& tar xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \

mkdocs/docs/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
<!-- prettier-ignore-start -->
1919

2020
<!-- markdown-link-check-disable -->
21+
# Summary
2122

2223
- [Getting started](index.md)
2324
- [Configuration](configuration.md)

mkdocs/docs/api.md

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,8 @@ catalog.create_table(
146146
)
147147
```
148148

149+
When the table is created, all IDs in the schema are re-assigned to ensure uniqueness.
150+
149151
To create a table using a pyarrow schema:
150152

151153
```python
@@ -278,7 +280,7 @@ tbl.overwrite(df)
278280

279281
The data is written to the table, and when the table is read using `tbl.scan().to_arrow()`:
280282

281-
```
283+
```python
282284
pyarrow.Table
283285
city: string
284286
lat: double
@@ -301,7 +303,7 @@ tbl.append(df)
301303

302304
When reading the table `tbl.scan().to_arrow()` you can see that `Groningen` is now also part of the table:
303305

304-
```
306+
```python
305307
pyarrow.Table
306308
city: string
307309
lat: double
@@ -340,7 +342,7 @@ tbl.delete(delete_filter="city == 'Paris'")
340342
In the above example, any records where the city field value equals to `Paris` will be deleted.
341343
Running `tbl.scan().to_arrow()` will now yield:
342344

343-
```
345+
```python
344346
pyarrow.Table
345347
city: string
346348
lat: double
@@ -360,7 +362,6 @@ To explore the table metadata, tables can be inspected.
360362
!!! tip "Time Travel"
361363
To inspect a tables's metadata with the time travel feature, call the inspect table method with the `snapshot_id` argument.
362364
Time travel is supported on all metadata tables except `snapshots` and `refs`.
363-
364365
```python
365366
table.inspect.entries(snapshot_id=805611270568163028)
366367
```
@@ -375,7 +376,7 @@ Inspect the snapshots of the table:
375376
table.inspect.snapshots()
376377
```
377378

378-
```
379+
```python
379380
pyarrow.Table
380381
committed_at: timestamp[ms] not null
381382
snapshot_id: int64 not null
@@ -403,7 +404,7 @@ Inspect the partitions of the table:
403404
table.inspect.partitions()
404405
```
405406

406-
```
407+
```python
407408
pyarrow.Table
408409
partition: struct<dt_month: int32, dt_day: date32[day]> not null
409410
child 0, dt_month: int32
@@ -444,7 +445,7 @@ To show all the table's current manifest entries for both data and delete files.
444445
table.inspect.entries()
445446
```
446447

447-
```
448+
```python
448449
pyarrow.Table
449450
status: int8 not null
450451
snapshot_id: int64 not null
@@ -602,7 +603,7 @@ To show a table's known snapshot references:
602603
table.inspect.refs()
603604
```
604605

605-
```
606+
```python
606607
pyarrow.Table
607608
name: string not null
608609
type: string not null
@@ -627,7 +628,7 @@ To show a table's current file manifests:
627628
table.inspect.manifests()
628629
```
629630

630-
```
631+
```python
631632
pyarrow.Table
632633
content: int8 not null
633634
path: string not null
@@ -677,7 +678,7 @@ To show table metadata log entries:
677678
table.inspect.metadata_log_entries()
678679
```
679680

680-
```
681+
```python
681682
pyarrow.Table
682683
timestamp: timestamp[ms] not null
683684
file: string not null
@@ -700,7 +701,7 @@ To show a table's history:
700701
table.inspect.history()
701702
```
702703

703-
```
704+
```python
704705
pyarrow.Table
705706
made_current_at: timestamp[ms] not null
706707
snapshot_id: int64 not null
@@ -721,7 +722,7 @@ Inspect the data files in the current snapshot of the table:
721722
table.inspect.files()
722723
```
723724

724-
```
725+
```python
725726
pyarrow.Table
726727
content: int8 not null
727728
file_path: string not null
@@ -861,7 +862,7 @@ To show only data files or delete files in the current snapshot, use `table.insp
861862

862863
Expert Iceberg users may choose to commit existing parquet files to the Iceberg table as data files, without rewriting them.
863864

864-
```
865+
```python
865866
# Given that these parquet files have schema consistent with the Iceberg table
866867
867868
file_paths = [
@@ -941,7 +942,7 @@ with table.update_schema() as update:
941942

942943
Now the table has the union of the two schemas `print(table.schema())`:
943944

944-
```
945+
```python
945946
table {
946947
1: city: optional string
947948
2: lat: optional double
@@ -1191,7 +1192,7 @@ table.scan(
11911192

11921193
This will return a PyArrow table:
11931194

1194-
```
1195+
```python
11951196
pyarrow.Table
11961197
VendorID: int64
11971198
tpep_pickup_datetime: timestamp[us, tz=+00:00]
@@ -1233,7 +1234,7 @@ table.scan(
12331234

12341235
This will return a Pandas dataframe:
12351236

1236-
```
1237+
```python
12371238
VendorID tpep_pickup_datetime tpep_dropoff_datetime
12381239
0 2 2021-04-01 00:28:05+00:00 2021-04-01 00:47:59+00:00
12391240
1 1 2021-04-01 00:39:01+00:00 2021-04-01 00:57:39+00:00
@@ -1306,7 +1307,7 @@ ray_dataset = table.scan(
13061307

13071308
This will return a Ray dataset:
13081309

1309-
```
1310+
```python
13101311
Dataset(
13111312
num_blocks=1,
13121313
num_rows=1168798,
@@ -1357,7 +1358,7 @@ df = df.select("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime")
13571358

13581359
This returns a Daft Dataframe which is lazily materialized. Printing `df` will display the schema:
13591360

1360-
```
1361+
```python
13611362
╭──────────┬───────────────────────────────┬───────────────────────────────╮
13621363
│ VendorID ┆ tpep_pickup_datetime ┆ tpep_dropoff_datetime │
13631364
---------
@@ -1375,7 +1376,7 @@ This is correctly optimized to take advantage of Iceberg features such as hidden
13751376
df.show(2)
13761377
```
13771378

1378-
```
1379+
```python
13791380
╭──────────┬───────────────────────────────┬───────────────────────────────╮
13801381
│ VendorID ┆ tpep_pickup_datetime ┆ tpep_dropoff_datetime │
13811382
---------

0 commit comments

Comments
 (0)