What happens?
Filtering on a string column with WHERE length(col) = <integer> via iceberg_scan() returns 0 rows starting in DuckDB 1.4.2. The same query returns the correct results in DuckDB 1.4.1.
The bug does not affect read_parquet() on the same underlying files. It also does not affect equivalent predicates written as WHERE length(col) > N-1 AND length(col) < N+1 or WHERE length(col) IN (N) — both of those return the correct rows via iceberg_scan() in all versions.
To Reproduce
import duckdb
import pyarrow as pa
import pyiceberg
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType
print(f"DuckDB version: {duckdb.__version__}")
print(f"PyIceberg version: {pyiceberg.__version__}")
print(f"PyArrow version: {pa.__version__}")
WAREHOUSE_PATH = "/tmp/iceberg_debug"
NAMESPACE = "ns"
TABLE_NAME = "items"
catalog = SqlCatalog(
"local",
uri="sqlite:///:memory:",
warehouse=f"file://{WAREHOUSE_PATH}",
)
catalog.create_namespace_if_not_exists(NAMESPACE)
schema = Schema(NestedField(1, "item_id", StringType()))
full_table_name = f"{NAMESPACE}.{TABLE_NAME}"
if catalog.table_exists(full_table_name):
catalog.drop_table(full_table_name)
iceberg_table = catalog.create_table(full_table_name, schema=schema)
# Two rows with length=36 (should match), one with 37 and one with 35 (should not match)
iceberg_table.append(pa.table({"item_id": [
"10000000-0000-0000-0000-000000000001", # length=36 ← should match
"20000000-0000-0000-0000-000000000002", # length=36 ← should match
"30000000-0000-0000-0000-0000000000033", # length=37 ← should NOT match
"40000000-0000-0000-0000-00000000004", # length=35 ← should NOT match
]}))
con = duckdb.connect()
con.execute("INSTALL iceberg; LOAD iceberg; SET unsafe_enable_version_guessing = true;")
SCAN = f"iceberg_scan('{iceberg_table.location()}')"
print("\n-- Query 1 (broken): equality predicate on length()")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) = 36").pl())
print("\n-- Query 2 (workaround): range predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) > 35 AND length(item_id) < 37").pl())
print("\n-- Query 3 (workaround): IN predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) IN (36)").pl())
parquet_glob = f"{WAREHOUSE_PATH}/{NAMESPACE}/{TABLE_NAME}/data/*.parquet"
print("\n-- Query 4 (workaround): read_parquet with same equality predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM read_parquet('{parquet_glob}') WHERE length(item_id) = 36").pl())
catalog.purge_table(full_table_name)
catalog.drop_namespace(NAMESPACE)
Output for DuckDB 1.4.1:
-- Query 1 (broken): equality predicate on length()
shape: (2, 2)
┌──────────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000000001 ┆ 36 │
│ 20000000-0000-0000-0000-000000000002 ┆ 36 │
└──────────────────────────────────────┴─────┘
-- Query 2 (workaround): range predicate
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 3 (workaround): IN predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 4 (workaround): read_parquet with same equality predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
Output for DuckDB 1.4.2+ (Tested for each version up to and including 1.5.0):
-- Query 1 (broken): equality predicate on length()
shape: (0, 2)
┌─────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪═════╡
└─────────┴─────┘
-- Query 2 (workaround): range predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 3 (workaround): IN predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 4 (workaround): read_parquet with same equality predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
OS:
Ubuntu 24.04 x86_64
DuckDB Package Version:
1.4.2
Python Version:
3.11
Full Name:
Lasse Schier Christiansen
Affiliation:
TV2 Danmark
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?
What happens?
Filtering on a string column with
WHERE length(col) = <integer>viaiceberg_scan()returns 0 rows starting in DuckDB 1.4.2. The same query returns the correct results in DuckDB 1.4.1.The bug does not affect
read_parquet()on the same underlying files. It also does not affect equivalent predicates written asWHERE length(col) > N-1 AND length(col) < N+1orWHERE length(col) IN (N)— both of those return the correct rows viaiceberg_scan()in all versions.To Reproduce
Output for DuckDB 1.4.1:
Output for DuckDB 1.4.2+ (Tested for each version up to and including 1.5.0):
OS:
Ubuntu 24.04 x86_64
DuckDB Package Version:
1.4.2
Python Version:
3.11
Full Name:
Lasse Schier Christiansen
Affiliation:
TV2 Danmark
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?