lance-format
diff --git a/‎docs/src/operations/ddl/create-index.md‎
Lines changed: 35 additions & 5 deletions b/‎docs/src/operations/ddl/create-index.md‎
Lines changed: 35 additions & 5 deletions
diff --git a/‎integration-tests/test_lance_spark.py‎
Lines changed: 81 additions & 46 deletions b/‎integration-tests/test_lance_spark.py‎
Lines changed: 81 additions & 46 deletions
@@ -7,7 +7,7 @@ Creates a scalar index on a Lance table to accelerate queries.
 
 ## Overview
 
-The `CREATE INDEX` command builds an index on one or more columns of a Lance table. Indexing can improve the performance of queries that filter on the indexed columns. This operation is performed in a distributed manner, building indexes for each data fragment in parallel.
+The `CREATE INDEX` command builds an index on one or more columns of a Lance table. Indexing can improve the performance of queries that filter on the indexed columns. Depending on the index method, Lance Spark either uses a fragment-parallel build path or a driver-coordinated commit flow after parallel executor builds.
 
 ## Basic Usage
 
@@ -24,13 +24,23 @@ The following index methods are supported:
 
 | Method  | Description                                                                 |
 |---------|-----------------------------------------------------------------------------|
+| `zonemap` | Lightweight min/max index for fragment pruning on a scalar column. |
 | `btree` | B-tree index for efficient range queries and point lookups on scalar columns. |
 | `fts`   | Full-text search (inverted) index for text search on string columns.        |
 
 ## Options
 
 The `CREATE INDEX` command supports options via the `WITH` clause to control index creation. These options are specific to the chosen index method.
 
+### ZoneMap Options
+
+For the `zonemap` method, the following options are supported:
+
+| Option          | Type | Description                                  |
+|-----------------|------|----------------------------------------------|
+| `rows_per_zone` | Long    | The approximate number of rows per zonemap zone. |
+| `num_segments`  | Integer | Target number of index segments (upper bound; clamped to fragment count when larger). Each segment covers a batch of fragments. Defaults to `min(fragment_count, spark.default.parallelism)`. |
+
 ### BTree Options
 
 For the `btree` method, the following options are supported:
@@ -92,6 +102,15 @@ Create a composite index on multiple columns.
     ALTER TABLE lance.db.logs CREATE INDEX idx_ts_level USING btree (timestamp, level);
     ```
 
+### Lightweight Fragment Pruning
+
+Create a zonemap index when you want lightweight min/max-based fragment pruning:
+
+=== "SQL"
+    ```sql
+    ALTER TABLE lance.db.users CREATE INDEX idx_id_zonemap USING zonemap (id);
+    ```
+
 ### Indexing with Options
 
 Create an index and specify the `zone_size` for the B-tree:
@@ -101,6 +120,15 @@ Create an index and specify the `zone_size` for the B-tree:
     ALTER TABLE lance.db.users CREATE INDEX idx_id_zoned USING btree (id) WITH (zone_size = 2048);
     ```
 
+### Zonemap with Options
+
+Create a zonemap index and specify the approximate number of rows per zone:
+
+=== "SQL"
+    ```sql
+    ALTER TABLE lance.db.users CREATE INDEX idx_id_zonemap USING zonemap (id) WITH (rows_per_zone = 2048);
+    ```
+
 ### Full-Text Search Index
 
 Create an FTS index on a text column:
@@ -133,17 +161,19 @@ The `CREATE INDEX` command returns the following information about the operation
 Consider creating an index when:
 
 - You frequently filter a large table on a specific column.
+- You want lightweight fragment pruning based on per-zone min/max statistics.
 - Your queries involve point lookups or small range scans.
 
 ## How It Works
 
 The `CREATE INDEX` command operates as follows:
 
-1.  **Distributed Index Building**: For each fragment in the Lance dataset, a separate task is launched to build an index on the specified column(s).
-2.  **Metadata Merging**: Once all per-fragment indexes are built, their metadata is collected and merged.
+1.  **Index Build Execution**: Lance Spark chooses an execution path based on the index method. Methods such as `btree`, `fts`, and `zonemap` can build physical index segments in parallel across fragments. `zonemap` publishes those segments directly as one logical index. Range-mode `btree` uses Spark repartitioning and sorted preprocessed data.
+2.  **Metadata Finalization**: Lance Spark merges or commits the resulting index metadata on the driver so the new logical index becomes visible atomically.
 3.  **Transactional Commit**: A new table version is committed with the new index information. The operation is atomic and ensures that concurrent reads are not affected.
 
 ## Notes and Limitations
 
-- **Index Methods**: The `btree` and `fts` methods are supported for scalar index creation.
-- **Index Replacement**: If you create an index with the same name as an existing one, the old index will be replaced by the new one. This is because the underlying implementation uses `replace(true)`.
+- **Index Methods**: The `zonemap`, `btree`, and `fts` methods are supported for scalar index creation.
+- **Zonemap Column Count**: Zonemap indexes currently support a single column only. The generic `CREATE INDEX` grammar accepts a column list, but Lance rejects multi-column zonemap creation.
+- **Index Replacement**: If you create an index with the same name as an existing one, the old index will be replaced by the new one.
@@ -848,6 +848,42 @@ def test_create_btree_index_on_string(self, spark):
         assert len(query_result) == 3
         _assert_lance_index_metadata(spark, "default.employees", "idx_dept", "BTREE")
 
+    def test_create_zonemap_index_on_int(self, spark):
+        """Test CREATE INDEX with ZoneMap on integer column."""
+        spark.sql("""
+            CREATE TABLE default.test_table (
+                id INT,
+                name STRING,
+                value DOUBLE
+            )
+        """)
+
+        data = [(i, f"Name{i}", float(i * 10)) for i in range(100)]
+        df = spark.createDataFrame(data, ["id", "name", "value"])
+        df.writeTo("default.test_table").append()
+
+        result = spark.sql("""
+            ALTER TABLE default.test_table
+            CREATE INDEX idx_id_zonemap USING zonemap (id)
+            WITH (rows_per_zone = 8)
+        """).collect()
+
+        assert len(result) == 1
+        assert result[0][1] == "idx_id_zonemap"
+
+        indexes = spark.sql("""
+            SHOW INDEXES IN default.test_table
+        """).collect()
+        zonemap_rows = [row for row in indexes if row["name"] == "idx_id_zonemap"]
+        assert len(zonemap_rows) >= 1
+        assert zonemap_rows[0]["index_type"] == "zonemap"
+
+        query_result = spark.sql("""
+            SELECT * FROM default.test_table WHERE id = 50
+        """).collect()
+        assert len(query_result) == 1
+        assert query_result[0].id == 50
+
     def test_create_fts_index(self, spark):
         """Test CREATE INDEX with full-text search (FTS)."""
         spark.sql("""
@@ -2582,26 +2618,26 @@ class TestStableRowIds:
       _row_created_at_version for all rows in that fragment.
     """
 
-    def test_tblproperties_enable_stable_row_ids(self, spark, test_table):
+    def test_tblproperties_enable_stable_row_ids(self, spark):
         """Test that TBLPROPERTIES enables CDF version columns."""
-        spark.sql(f"""
-            CREATE TABLE {test_table} (
+        spark.sql("""
+            CREATE TABLE default.test_table (
                 id INT,
                 name STRING,
                 value INT
             ) TBLPROPERTIES ('enable_stable_row_ids' = 'true')
         """)
 
-        spark.sql(f"""
-            INSERT INTO {test_table} VALUES
+        spark.sql("""
+            INSERT INTO default.test_table VALUES
             (1, 'Alice', 100),
             (2, 'Bob', 200),
             (3, 'Charlie', 300)
         """)
 
-        result = spark.sql(f"""
+        result = spark.sql("""
             SELECT id, _row_created_at_version, _row_last_updated_at_version
-            FROM {test_table}
+            FROM default.test_table
             ORDER BY id
         """).collect()
 
@@ -2610,30 +2646,30 @@ def test_tblproperties_enable_stable_row_ids(self, spark, test_table):
             assert row._row_created_at_version is not None
             assert row._row_last_updated_at_version is not None
 
-    def test_default_behavior_no_stable_row_ids(self, spark, test_table):
+    def test_default_behavior_no_stable_row_ids(self, spark):
         """Test version columns without enable_stable_row_ids.
 
         Without enable_stable_row_ids the Lance engine still populates
         _row_created_at_version and _row_last_updated_at_version, but
         returns a baseline value of 1 instead of the actual operation version.
         """
-        spark.sql(f"""
-            CREATE TABLE {test_table} (
+        spark.sql("""
+            CREATE TABLE default.test_table (
                 id INT,
                 name STRING,
                 value INT
             )
         """)
 
-        spark.sql(f"""
-            INSERT INTO {test_table} VALUES
+        spark.sql("""
+            INSERT INTO default.test_table VALUES
             (1, 'Alice', 100),
             (2, 'Bob', 200)
         """)
 
-        result = spark.sql(f"""
+        result = spark.sql("""
             SELECT id, _row_created_at_version, _row_last_updated_at_version
-            FROM {test_table}
+            FROM default.test_table
             ORDER BY id
         """).collect()
 
@@ -2644,23 +2680,23 @@ def test_default_behavior_no_stable_row_ids(self, spark, test_table):
             assert row._row_last_updated_at_version == 1
 
     @requires_update_or_merge
-    def test_cdc_incremental_ingestion_pattern(self, spark, test_table):
+    def test_cdc_incremental_ingestion_pattern(self, spark):
         """Test CDC incremental ingestion pipeline pattern.
 
         Simulates a CDC pipeline that tracks the last processed version and
         incrementally processes changes using version tracking columns.
         """
-        spark.sql(f"""
-            CREATE TABLE {test_table} (
+        spark.sql("""
+            CREATE TABLE default.test_table (
                 id INT,
                 name STRING,
                 value INT
             ) TBLPROPERTIES ('enable_stable_row_ids' = 'true')
         """)
 
         # v2: Initial data load
-        spark.sql(f"""
-            INSERT INTO {test_table} VALUES
+        spark.sql("""
+            INSERT INTO default.test_table VALUES
             (1, 'Alice', 100),
             (2, 'Bob', 200),
             (3, 'Charlie', 300)
@@ -2670,7 +2706,7 @@ def test_cdc_incremental_ingestion_pattern(self, spark, test_table):
         last_processed_version = 1
         batch1 = spark.sql(f"""
             SELECT id, name, value, _row_created_at_version, _row_last_updated_at_version
-            FROM {test_table}
+            FROM default.test_table
             WHERE (_row_created_at_version > {last_processed_version})
                OR (_row_last_updated_at_version > {last_processed_version})
             ORDER BY id
@@ -2681,13 +2717,13 @@ def test_cdc_incremental_ingestion_pattern(self, spark, test_table):
         last_processed_version = 2
 
         # v3: Update one row, v4: Insert new row
-        spark.sql(f"UPDATE {test_table} SET value = value + 50 WHERE id = 1")
-        spark.sql(f"INSERT INTO {test_table} VALUES (4, 'David', 400)")
+        spark.sql("UPDATE default.test_table SET value = value + 50 WHERE id = 1")
+        spark.sql("INSERT INTO default.test_table VALUES (4, 'David', 400)")
 
         # CDC Pipeline: Process batch 2 (changes since v2)
         batch2 = spark.sql(f"""
             SELECT id, name, value, _row_created_at_version, _row_last_updated_at_version
-            FROM {test_table}
+            FROM default.test_table
             WHERE (_row_created_at_version > {last_processed_version})
                OR (_row_last_updated_at_version > {last_processed_version})
             ORDER BY id
@@ -2707,12 +2743,12 @@ def test_cdc_incremental_ingestion_pattern(self, spark, test_table):
         last_processed_version = 4
 
         # v5: More updates
-        spark.sql(f"UPDATE {test_table} SET value = value + 100 WHERE id IN (2, 3)")
+        spark.sql("UPDATE default.test_table SET value = value + 100 WHERE id IN (2, 3)")
 
         # CDC Pipeline: Process batch 3 (changes since v4)
         batch3 = spark.sql(f"""
             SELECT id, name, value, _row_created_at_version, _row_last_updated_at_version
-            FROM {test_table}
+            FROM default.test_table
             WHERE (_row_created_at_version > {last_processed_version})
                OR (_row_last_updated_at_version > {last_processed_version})
             ORDER BY id
@@ -2729,12 +2765,12 @@ def test_cdc_incremental_ingestion_pattern(self, spark, test_table):
         last_processed_version = 5
 
         # v6: Update entire table
-        spark.sql(f"UPDATE {test_table} SET value = value * 2")
+        spark.sql("UPDATE default.test_table SET value = value * 2")
 
         # CDC Pipeline: Process batch 4 (changes since v5)
         batch4 = spark.sql(f"""
             SELECT id, name, value, _row_created_at_version, _row_last_updated_at_version
-            FROM {test_table}
+            FROM default.test_table
             WHERE (_row_created_at_version > {last_processed_version})
                OR (_row_last_updated_at_version > {last_processed_version})
             ORDER BY id
@@ -2780,30 +2816,29 @@ def _register_cdf_catalog(self, spark):
 
         return catalog_name
 
-    def test_catalog_level_stable_row_ids(self, spark, test_table):
+    def test_catalog_level_stable_row_ids(self, spark):
         """Test that catalog-level enable_stable_row_ids enables version columns without TBLPROPERTIES."""
         catalog_name = self._register_cdf_catalog(spark)
-        cdf_table = f"{catalog_name}.{test_table}"
 
         try:
             # CREATE TABLE without TBLPROPERTIES — relies on catalog-level default
             spark.sql(f"""
-                CREATE TABLE {cdf_table} (
+                CREATE TABLE {catalog_name}.default.test_table (
                     id INT,
                     name STRING,
                     value INT
                 )
             """)
 
             spark.sql(f"""
-                INSERT INTO {cdf_table} VALUES
+                INSERT INTO {catalog_name}.default.test_table VALUES
                 (1, 'Alice', 100),
                 (2, 'Bob', 200)
             """)
 
             result = spark.sql(f"""
                 SELECT id, _row_created_at_version, _row_last_updated_at_version
-                FROM {cdf_table}
+                FROM {catalog_name}.default.test_table
                 ORDER BY id
             """).collect()
 
@@ -2813,52 +2848,52 @@ def test_catalog_level_stable_row_ids(self, spark, test_table):
                 assert row._row_last_updated_at_version is not None
         finally:
             try:
-                spark.sql(f"DROP TABLE IF EXISTS {cdf_table} PURGE")
+                spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.default.test_table PURGE")
             except Exception as e:
-                print(f"Failed to clean up {cdf_table}: {e}")
+                print(f"Failed to clean up {catalog_name}.default.test_table: {e}")
 
 
     @requires_update_or_merge
-    def test_update_preserves_row_ids(self, spark, test_table):
+    def test_update_preserves_row_ids(self, spark):
         """Test that UPDATE preserves _rowid values when stable row IDs are enabled.
 
         Verifies the native DeltaWriter.update() path with RowIdMeta attachment
         keeps row IDs stable across updates, including multi-fragment scenarios.
         """
-        spark.sql(f"""
-            CREATE TABLE {test_table} (
+        spark.sql("""
+            CREATE TABLE default.test_table (
                 id INT,
                 name STRING,
                 value INT
             ) TBLPROPERTIES ('enable_stable_row_ids' = 'true')
         """)
 
         # Insert across two fragments
-        spark.sql(f"""
-            INSERT INTO {test_table} VALUES
+        spark.sql("""
+            INSERT INTO default.test_table VALUES
             (1, 'Alice', 100),
             (2, 'Bob', 200),
             (3, 'Charlie', 300)
         """)
-        spark.sql(f"""
-            INSERT INTO {test_table} VALUES
+        spark.sql("""
+            INSERT INTO default.test_table VALUES
             (4, 'Dave', 400),
             (5, 'Eve', 500)
         """)
 
         # Capture row IDs before update
-        before = spark.sql(f"""
-            SELECT id, _rowid FROM {test_table} ORDER BY id
+        before = spark.sql("""
+            SELECT id, _rowid FROM default.test_table ORDER BY id
         """).collect()
         row_ids_before = {row.id: row._rowid for row in before}
         assert len(row_ids_before) == 5
 
         # Update rows spanning both fragments
-        spark.sql(f"UPDATE {test_table} SET value = value + 1 WHERE value >= 200")
+        spark.sql("UPDATE default.test_table SET value = value + 1 WHERE value >= 200")
 
         # Capture row IDs after update
-        after = spark.sql(f"""
-            SELECT id, _rowid, value FROM {test_table} ORDER BY id
+        after = spark.sql("""
+            SELECT id, _rowid, value FROM default.test_table ORDER BY id
         """).collect()
         row_ids_after = {row.id: row._rowid for row in after}