Skip to content

Commit 1efa6fe

Browse files
Use manual Schema for plain replace_table, df.schema for RTAS
Per reviewer feedback: bare replace_table examples and tests should construct an explicit Schema, since that's the natural user-facing API for DDL-only redefinition. RTAS flows keep df.schema since the data and schema are coupled there.
1 parent b4d76c1 commit 1efa6fe

2 files changed

Lines changed: 27 additions & 12 deletions

File tree

mkdocs/docs/api.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -190,10 +190,18 @@ with catalog.create_table_transaction(identifier="docs_example.bids", schema=sch
190190
Atomically replace an existing table's schema, partition spec, sort order, location, and properties. The table UUID and history (snapshots, schemas, specs, sort orders, metadata log) are preserved; the current snapshot is cleared (the `main` branch ref is removed). `replace_table` redefines the table in this way; `replace_table_transaction` lets you write new data alongside this change to permit RTAS (replace-table-as-select) workflows.
191191

192192
```python
193-
catalog.replace_table(identifier="docs_example.bids", schema=df.schema)
193+
from pyiceberg.schema import Schema
194+
from pyiceberg.types import NestedField, LongType, StringType, BooleanType
195+
196+
new_schema = Schema(
197+
NestedField(field_id=1, name="datetime", field_type=LongType(), required=False),
198+
NestedField(field_id=2, name="symbol", field_type=StringType(), required=False),
199+
NestedField(field_id=3, name="active", field_type=BooleanType(), required=False),
200+
)
201+
catalog.replace_table(identifier="docs_example.bids", schema=new_schema)
194202
```
195203

196-
Where `df` is a PyArrow table (or `Schema`) carrying the new column set. Field IDs from columns whose names appear in the previous schema are reused, so existing data files remain readable when the new schema is a compatible superset. New columns get fresh IDs above `last-column-id`.
204+
Field IDs from columns whose names appear in the previous schema are reused, so existing data files remain readable when the new schema is a compatible superset. New columns get fresh IDs above `last-column-id`.
197205

198206
Properties passed to `replace_table` are **merged** with the existing table properties (your values override; existing keys you don't pass are preserved). To remove a property as part of the replace, use `replace_table_transaction` and remove it explicitly within the transaction.
199207

@@ -208,7 +216,7 @@ with catalog.replace_table_transaction(identifier="docs_example.bids", schema=df
208216
To upgrade the table's format version as part of the replace, pass `format-version` in `properties`:
209217

210218
```python
211-
catalog.replace_table(identifier="docs_example.bids", schema=df.schema, properties={"format-version": "2"})
219+
catalog.replace_table(identifier="docs_example.bids", schema=new_schema, properties={"format-version": "2"})
212220
```
213221

214222
## Register a table

tests/integration/test_rest_catalog.py

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
from pyiceberg.catalog.rest import RestCatalog
2727
from pyiceberg.exceptions import NoSuchViewError
2828
from pyiceberg.schema import Schema
29+
from pyiceberg.types import BooleanType, LongType, NestedField, StringType
2930
from pyiceberg.view.metadata import SQLViewRepresentation, ViewVersion
3031

3132
TEST_NAMESPACE_IDENTIFIER = "TEST NS"
@@ -85,20 +86,26 @@ def test_replace_table_end_to_end_against_rest_server(catalog: Catalog) -> None:
8586
if catalog.table_exists(identifier):
8687
catalog.drop_table(identifier)
8788

88-
pa_table = pa.Table.from_pydict(
89-
{"id": [1, 2, 3], "data": ["a", "b", "c"]},
90-
schema=pa.schema([pa.field("id", pa.int64()), pa.field("data", pa.large_string())]),
89+
original_schema = Schema(
90+
NestedField(field_id=1, name="id", field_type=LongType(), required=False),
91+
NestedField(field_id=2, name="data", field_type=StringType(), required=False),
92+
)
93+
original = catalog.create_table(identifier, schema=original_schema)
94+
original.append(
95+
pa.Table.from_pydict(
96+
{"id": [1, 2, 3], "data": ["a", "b", "c"]},
97+
schema=pa.schema([pa.field("id", pa.int64()), pa.field("data", pa.large_string())]),
98+
)
9199
)
92-
original = catalog.create_table(identifier, schema=pa_table.schema)
93-
original.append(pa_table)
94100
original.refresh()
95101
original_snapshot_id = original.current_snapshot().snapshot_id # type: ignore[union-attr]
96102

97-
new_data = pa.Table.from_pydict(
98-
{"id": [10], "name": ["alice"], "active": [True]},
99-
schema=pa.schema([pa.field("id", pa.int64()), pa.field("name", pa.large_string()), pa.field("active", pa.bool_())]),
103+
new_schema = Schema(
104+
NestedField(field_id=1, name="id", field_type=LongType(), required=False),
105+
NestedField(field_id=2, name="name", field_type=StringType(), required=False),
106+
NestedField(field_id=3, name="active", field_type=BooleanType(), required=False),
100107
)
101-
replaced = catalog.replace_table(identifier, schema=new_data.schema)
108+
replaced = catalog.replace_table(identifier, schema=new_schema)
102109

103110
assert replaced.metadata.table_uuid == original.metadata.table_uuid
104111
assert replaced.current_snapshot() is None

0 commit comments

Comments
 (0)