diff --git a/docs/get-started/VeloxIceberg.md b/docs/get-started/VeloxIceberg.md index aea2e89c76f..46ff2e5b567 100644 --- a/docs/get-started/VeloxIceberg.md +++ b/docs/get-started/VeloxIceberg.md @@ -124,15 +124,15 @@ The "Gluten Support" column is now ready to be populated with: | spark.sql.iceberg.check-ordering | true | Validates the write schema column order matches the table schema order |✅ | | spark.sql.iceberg.planning.preserve-data-grouping | false | When true, co-locate scan tasks for the same partition in the same read split, used in Storage Partitioned Joins |✅ | | spark.sql.iceberg.aggregate-push-down.enabled | true | Enables pushdown of aggregate functions (MAX, MIN, COUNT) | | -| spark.sql.iceberg.distribution-mode | See Spark Writes | Controls distribution strategy during writes | ✅ | +| spark.sql.iceberg.distribution-mode | See Spark Writes | Controls distribution strategy during writes | 🚫 | | spark.wap.id | null | Write-Audit-Publish snapshot staging ID | | | spark.wap.branch | null | WAP branch name for snapshot commit | | -| spark.sql.iceberg.compression-codec | Table default | Write compression codec (e.g., zstd, snappy) | | -| spark.sql.iceberg.compression-level | Table default | Compression level for Parquet/Avro | | -| spark.sql.iceberg.compression-strategy | Table default | Compression strategy for ORC | | +| spark.sql.iceberg.compression-codec | Table default | Write compression codec (e.g., zstd, snappy) |✅| +| spark.sql.iceberg.compression-level | Table default | Compression level for Parquet/Avro |❌| +| spark.sql.iceberg.compression-strategy | Table default | Compression strategy for ORC |❌| | spark.sql.iceberg.data-planning-mode | AUTO | Scan planning mode for data files (AUTO, LOCAL, DISTRIBUTED) | | | spark.sql.iceberg.delete-planning-mode | AUTO | Scan planning mode for delete files (AUTO, LOCAL, DISTRIBUTED) | | -| spark.sql.iceberg.advisory-partition-size | Table default | Advisory size (bytes) used for writing to the Table when Spark's Adaptive Query Execution is enabled. Used to size output files | | +| spark.sql.iceberg.advisory-partition-size | Table default | Advisory size (bytes) used for writing to the Table when Spark's Adaptive Query Execution is enabled. Used to size output files |❌| | spark.sql.iceberg.locality.enabled | false | Report locality information for Spark task placement on executors |✅ | | spark.sql.iceberg.executor-cache.enabled | true | Enables cache for executor-side (currently used to cache Delete Files) |❌| | spark.sql.iceberg.executor-cache.timeout | 10 | Timeout in minutes for executor cache entries |❌| @@ -161,14 +161,14 @@ The "Gluten Support" column is now ready to be populated with: | Spark option | Default | Description | Gluten Support | | --- | --- | --- | --- | | write-format | Table write.format.default | File format to use for this write operation; parquet, avro, or orc |⚠️ Parquet only| -| target-file-size-bytes | As per table property | Overrides this table's write.target-file-size-bytes | | +| target-file-size-bytes | As per table property | Overrides this table's write.target-file-size-bytes | ✅ | | check-nullability | true | Sets the nullable check on fields | | | snapshot-property.custom-key | null | Adds an entry with custom-key and corresponding value in the snapshot summary (the snapshot-property. prefix is only required for DSv2) | | | fanout-enabled | false | Overrides this table's write.spark.fanout.enabled |✅| | check-ordering | true | Checks if input schema and table schema are same | | | isolation-level | null | Desired isolation level for Dataframe overwrite operations. null => no checks (for idempotent writes), serializable => check for concurrent inserts or deletes in destination partitions, snapshot => checks for concurrent deletes in destination partitions. | | | validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via Table API or Snapshots table. If null, the table's oldest known snapshot is used. | | -| compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | | +| compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write |⚠️ Parquet only| | compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | | | compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | | | distribution-mode | See Spark Writes for defaults | Override this table's distribution mode for this write |🚫| @@ -194,26 +194,26 @@ extracted from https://iceberg.apache.org/docs/latest/configuration/ | Property | Default | Description | Gluten Support | | --- | --- | --- | --- | -| write.format.default | parquet | Default file format for the table; parquet, avro, or orc | | +| write.format.default | parquet | Default file format for the table; parquet, avro, or orc |⚠️ Parquet only| | write.delete.format.default | data file format | Default delete file format for the table; parquet, avro, or orc | | | write.parquet.row-group-size-bytes | 134217728 (128 MB) | Parquet row group size | | | write.parquet.page-size-bytes | 1048576 (1 MB) | Parquet page size |✅| | write.parquet.page-row-limit | 20000 | Parquet page row limit | | | write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size | | -| write.parquet.compression-codec | zstd | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed | | +| write.parquet.compression-codec | zstd | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed |✅| | write.parquet.compression-level | null | Parquet compression level | | | write.parquet.bloom-filter-enabled.column.col1 | (not set) | Hint to parquet to write a bloom filter for the column: 'col1' | | | write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset | | | write.parquet.bloom-filter-fpp.column.col1 | 0.01 | The false positive probability for a bloom filter applied to 'col1' (must > 0.0 and < 1.0) | | | write.parquet.stats-enabled.column.col1 | (not set) | Controls whether to collect parquet column statistics for column 'col1' | | -| write.avro.compression-codec | gzip | Avro compression codec: gzip(deflate with 9 level), zstd, snappy, uncompressed | | -| write.avro.compression-level | null | Avro compression level | | -| write.orc.stripe-size-bytes | 67108864 (64 MB) | Define the default ORC stripe size, in bytes | | -| write.orc.block-size-bytes | 268435456 (256 MB) | Define the default file system block size for ORC files | | -| write.orc.compression-codec | zlib | ORC compression codec: zstd, lz4, lzo, zlib, snappy, none | | -| write.orc.compression-strategy | speed | ORC compression strategy: speed, compression | | -| write.orc.bloom.filter.columns | (not set) | Comma separated list of column names for which a Bloom filter must be created | | -| write.orc.bloom.filter.fpp | 0.05 | False positive probability for Bloom filter (must > 0.0 and < 1.0) | | +| write.avro.compression-codec | gzip | Avro compression codec: gzip(deflate with 9 level), zstd, snappy, uncompressed |❌| +| write.avro.compression-level | null | Avro compression level |❌| +| write.orc.stripe-size-bytes | 67108864 (64 MB) | Define the default ORC stripe size, in bytes |❌| +| write.orc.block-size-bytes | 268435456 (256 MB) | Define the default file system block size for ORC files |❌| +| write.orc.compression-codec | zlib | ORC compression codec: zstd, lz4, lzo, zlib, snappy, none |❌| +| write.orc.compression-strategy | speed | ORC compression strategy: speed, compression |❌| +| write.orc.bloom.filter.columns | (not set) | Comma separated list of column names for which a Bloom filter must be created |❌| +| write.orc.bloom.filter.fpp | 0.05 | False positive probability for Bloom filter (must > 0.0 and < 1.0) |❌| | write.location-provider.impl | null | Optional custom implementation for LocationProvider | | | write.metadata.compression-codec | none | Metadata compression codec; none or gzip | | | write.metadata.metrics.max-inferred-column-defaults | 100 | Defines the maximum number of columns for which metrics are collected. Columns are included with a pre-order traversal of the schema: top level fields first; then all elements of the first nested s... | | @@ -221,15 +221,15 @@ extracted from https://iceberg.apache.org/docs/latest/configuration/ | write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | | | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes |✅| | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | | -| write.distribution-mode | not set, see engines for specific defaults, for example Spark Writes | Defines distribution of write data: none: don't shuffle rows; hash: hash distribute by partition key ; range: range distribute by partition key or sort key if table has an SortOrder | | -| write.delete.distribution-mode | (not set) | Defines distribution of write delete data | | -| write.update.distribution-mode | (not set) | Defines distribution of write update data | | -| write.merge.distribution-mode | (not set) | Defines distribution of write merge data | | +| write.distribution-mode | not set, see engines for specific defaults, for example Spark Writes | Defines distribution of write data: none: don't shuffle rows; hash: hash distribute by partition key ; range: range distribute by partition key or sort key if table has an SortOrder |🚫| +| write.delete.distribution-mode | (not set) | Defines distribution of write delete data |🚫| +| write.update.distribution-mode | (not set) | Defines distribution of write update data |🚫| +| write.merge.distribution-mode | (not set) | Defines distribution of write merge data |🚫| | write.wap.enabled | false | Enables write-audit-publish writes | | | write.summary.partition-limit | 0 | Includes partition-level summary stats in snapshot summaries if the changed partition count is less than this limit | | | write.metadata.delete-after-commit.enabled | false | Controls whether to delete the oldest tracked version metadata files after each table commit. See the Remove old metadata files section for additional details | | | write.metadata.previous-versions-max | 100 | The max number of previous version metadata files to track | | -| write.spark.fanout.enabled | false | Enables the fanout writer in Spark that does not require data to be clustered; uses more memory | | +| write.spark.fanout.enabled | false | Enables the fanout writer in Spark that does not require data to be clustered; uses more memory |✅| | write.object-storage.enabled | false | Enables the object storage location provider that adds a hash component to file paths | | | write.object-storage.partitioned-paths | true | Includes the partition values in the file path | | | write.data.path | table location + /data | Base location for data files | |