Complete configuration settings for IndexTables4Spark.
spark.indextables.indexWriter.heapSize: "100M" (supports "2G", "500M", "1024K")
spark.indextables.indexWriter.batchSize: 10000
spark.indextables.indexWriter.maxBatchBufferSize: "90M" (default: 90MB, prevents native 100MB limit errors)
spark.indextables.indexWriter.threads: 2// Controls parallelism of tantivy index -> quickwit split conversion
spark.indextables.splitConversion.maxParallelism: <auto> (default: max(1, availableProcessors))spark.indextables.checkpoint.enabled: true
spark.indextables.checkpoint.interval: 10
spark.indextables.transaction.compression.enabled: true (default)Controls the native in-process txlog cache. All keys are optional; omitted keys use native defaults.
spark.indextables.transaction.cache.enabled: "true" // Set to "false" to disable all caching
spark.indextables.transaction.cache.ttl.ms: 300000 // Override all cache TTLs at once (ms)
spark.indextables.transaction.cache.version.ttl.ms: 300000 // Per-version-file cache TTL (ms)
spark.indextables.transaction.cache.snapshot.ttl.ms: 600000 // Per-snapshot cache TTL (ms)
spark.indextables.transaction.cache.fileList.ttl.ms: 120000 // Per-file-list cache TTL (ms)
spark.indextables.transaction.cache.metadata.ttl.ms: 1800000 // Protocol/metadata cache TTL (ms)
spark.indextables.transaction.cache.version.capacity: 1000 // Max cached version entries
spark.indextables.transaction.cache.snapshot.capacity: 100 // Max cached snapshot entries
spark.indextables.transaction.cache.fileList.capacity: 50 // Max cached file-list entriesLegacy key:
spark.indextables.transaction.cache.expirationSeconds(seconds) is still supported for backward compatibility. Ifcache.ttl.msis also set, it takes precedence.
spark.indextables.transaction.maxConcurrentReads: 32 // Max parallel object-store GETs (range: 16–64)Reduce if hitting S3/Azure rate limits; increase for high-bandwidth connections with many post-checkpoint versions. Values outside the recommended range are passed to the native layer as-is (no clamping).
Automatic retry on version conflicts:
spark.indextables.transaction.retry.maxAttempts: 10 (default: 10)
spark.indextables.transaction.retry.baseDelayMs: 100 (default: 100ms, initial backoff delay)
spark.indextables.transaction.retry.maxDelayMs: 5000 (default: 5000ms, maximum backoff cap)10x faster checkpoint reads, default since Protocol V4. Replaces JSON checkpoints with binary Avro manifests.
spark.indextables.state.format: "avro" (default: "avro", options: "avro", "json")
spark.indextables.state.compression: "zstd" (default: "zstd", options: "zstd", "snappy", "none")
spark.indextables.state.compressionLevel: 3 (default: 3, range 1-22 for zstd)
spark.indextables.state.entriesPerManifest: 50000 (default: 50000)
spark.indextables.state.read.parallelism: 8 (default: 8)Automatic rewrite to remove tombstones and optimize layout:
spark.indextables.state.compaction.tombstoneThreshold: 0.10 (default: 10%)
spark.indextables.state.compaction.maxManifests: 20 (default: 20)
spark.indextables.state.compaction.afterMerge: true (default: true)spark.indextables.state.retention.versions: 2 (default: 2, keep N old state versions)
spark.indextables.state.retention.hours: 168 (default: 168 = 7 days)Enabled by default to prevent transaction log bloat:
spark.indextables.stats.truncation.enabled: true
spark.indextables.stats.truncation.maxLength: 32Tokens longer than the limit are filtered out (not truncated):
spark.indextables.indexing.text.maxTokenLength: 255 (default: 255, Quickwit-compatible)
// Named constants: "tantivy_max" (65530), "default" (255), "legacy" (40), "min" (1)
// Per-field overrides: spark.indextables.indexing.tokenLength.<field>: <value>
// List-based syntax: spark.indextables.indexing.tokenLength.<value>: "field1,field2,..."Delta Lake compatible:
spark.indextables.dataSkippingStatsColumns: <column_list> (comma-separated, takes precedence over numIndexedCols)
spark.indextables.dataSkippingNumIndexedCols: 32 (default: 32, -1 for all eligible columns, 0 to disable)Iceberg-style shuffle before writing. Produces well-sized splits (~1GB) via AQE advisory partition sizes.
spark.indextables.write.optimizeWrite.enabled: false (default: false)
spark.indextables.write.optimizeWrite.targetSplitSize: "1G" (default: 1GB)
spark.indextables.write.optimizeWrite.samplingRatio: 1.1 (default: 1.1)
spark.indextables.write.optimizeWrite.minRowsForEstimation: 10000 (default: 10000)
spark.indextables.write.optimizeWrite.distributionMode: "hash" (default: "hash", options: "hash", "none")Automatic split consolidation during writes:
spark.indextables.mergeOnWrite.enabled: false (default: false)
spark.indextables.mergeOnWrite.targetSize: "4G" (default: 4G)Runs merges in background thread, allows indexing to continue:
spark.indextables.mergeOnWrite.async.enabled: true (default: true)
spark.indextables.mergeOnWrite.batchCpuFraction: 0.167 (default: 1/6, fraction of cluster CPUs per batch)
spark.indextables.mergeOnWrite.maxConcurrentBatches: 3 (default: 3)
spark.indextables.mergeOnWrite.minBatchesToTrigger: 1 (default: 1)
spark.indextables.mergeOnWrite.shutdownTimeoutMs: 300000 (default: 5 minutes)
// Threshold formula: threshold = batchSize x minBatchesToTrigger
// Batch size formula: batchSize = max(1, totalClusterCpus x batchCpuFraction)
// Example: 24 CPUs with defaults = batchSize 4, threshold 4 groups to trigger mergespark.indextables.mergeOnWrite.mergeGroupMultiplier: 2.0 (deprecated, use batchCpuFraction + minBatchesToTrigger)
spark.indextables.mergeOnWrite.minDiskSpaceGB: 20 (default: 20GB, use 1GB for tests)
spark.indextables.mergeOnWrite.maxConcurrentMergesPerWorker: <auto> (default: auto-calculated based on heap size)
spark.indextables.mergeOnWrite.memoryOverheadFactor: 3.0 (default: 3.0)Downloads source splits to local disk before merge, uploads merged split after:
spark.indextables.merge.download.maxConcurrencyPerCore: 8 (default: 8)
spark.indextables.merge.download.memoryBudget: "2G" (default: 2GB per executor)
spark.indextables.merge.download.retries: 3 (default: 3 with exponential backoff)
spark.indextables.merge.upload.maxConcurrency: 6 (default: 6)Automatic cleanup of orphaned files and old transaction logs:
spark.indextables.purgeOnWrite.enabled: false (default: false)
spark.indextables.purgeOnWrite.triggerAfterMerge: true (default: true)
spark.indextables.purgeOnWrite.triggerAfterWrites: 0 (default: 0 = disabled)
spark.indextables.purgeOnWrite.splitRetentionHours: 168 (default: 168 = 7 days)
spark.indextables.purgeOnWrite.txLogRetentionHours: 720 (default: 720 = 30 days)Reduces S3 requests by 90-95% for read operations:
spark.indextables.read.batchOptimization.enabled: true (default: true)
spark.indextables.read.batchOptimization.profile: "balanced" (options: conservative, balanced, aggressive, disabled)
spark.indextables.read.batchOptimization.maxRangeSize: "16M" (default: 16MB, range: 2MB-32MB)
spark.indextables.read.batchOptimization.gapTolerance: "512K" (default: 512KB, range: 64KB-2MB)
spark.indextables.read.batchOptimization.minDocsForOptimization: 50 (default: 50, range: 10-200)
spark.indextables.read.batchOptimization.maxConcurrentPrefetch: 8 (default: 8, range: 2-32)spark.indextables.read.adaptiveTuning.enabled: true (default: true)
spark.indextables.read.adaptiveTuning.minBatchesBeforeAdjustment: 5 (default: 5)Persistent NVMe caching across JVM restarts. Auto-enabled when /local_disk0 detected (Databricks/EMR):
spark.indextables.cache.disk.enabled: <auto> (default: auto-enabled when /local_disk0 detected)
spark.indextables.cache.disk.path: <auto> (default: "/local_disk0/tantivy4spark_slicecache")
spark.indextables.cache.disk.maxSize: "100G" (default: 0 = auto, 2/3 available disk)
spark.indextables.cache.disk.compression: "lz4" (options: lz4, zstd, none)
spark.indextables.cache.disk.minCompressSize: "4K" (default: 4096 bytes)
spark.indextables.cache.disk.manifestSyncInterval: 30 (default: 30 seconds)
spark.indextables.cache.disk.writeQueue.mode: "size" (options: fragment, size)
spark.indextables.cache.disk.writeQueue.capacity: "1G" (size mode: byte limit e.g. "500M", "2G"; fragment mode: slot count e.g. "32")
spark.indextables.cache.disk.dropWritesWhenFull: true (drop query-path writes instead of blocking)
spark.indextables.cache.disk.writeQueue.maxBudget: "0" (default: 0 = auto, 8x initial queue capacity)
spark.indextables.cache.coalesceMaxGap: "512K" (default: 512KB, max gap between parquet byte ranges to coalesce)spark.indextables.read.defaultLimit: 250 (default: 250, max docs per partition when no LIMIT pushed down)Auto-selection: 1 split/task for small tables, batched for larger tables:
spark.indextables.read.splitsPerTask: "auto" (default: auto, or numeric value)
spark.indextables.read.maxSplitsPerTask: 8 (default: 8)
spark.indextables.read.aggregate.splitsPerTask: (falls back to read.splitsPerTask)
spark.indextables.read.aggregate.maxSplitsPerTask: (falls back to read.maxSplitsPerTask)Reduces complexity from O(nf) to O(pf):
spark.indextables.partitionPruning.filterCacheEnabled: true (default: true)
spark.indextables.partitionPruning.indexEnabled: true (default: true)
spark.indextables.partitionPruning.parallelThreshold: 100 (default: 100)
spark.indextables.partitionPruning.selectivityOrdering: true (default: true)All disabled by default. Enable to allow aggregate pushdown with pattern filters. Note: These match individual tokens, not full strings for text fields.
spark.indextables.filter.stringPattern.pushdown: false (master switch)
spark.indextables.filter.stringStartsWith.pushdown: false (efficient - uses sorted index terms)
spark.indextables.filter.stringEndsWith.pushdown: false (less efficient - requires term scanning)
spark.indextables.filter.stringContains.pushdown: false (least efficient)Auto-detects /local_disk0 when available:
spark.indextables.indexWriter.tempDirectoryPath: "/local_disk0/temp" (or auto-detect)
spark.indextables.cache.directoryPath: "/local_disk0/cache" (or auto-detect)
spark.indextables.merge.tempDirectoryPath: "/local_disk0/merge-temp" (or auto-detect)spark.indextables.purge.defaultRetentionHours: 168 (7 days for splits)
spark.indextables.purge.minRetentionHours: 24 (safety check)
spark.indextables.purge.retentionCheckEnabled: true
spark.indextables.purge.parallelism: <auto>
spark.indextables.purge.maxFilesToDelete: 1000000
spark.indextables.purge.deleteRetries: 3Bridges tantivy4java's native (Rust) memory allocations with Spark's unified memory manager, giving Spark visibility and control over native memory usage (index writers, merge buffers, caches).
Prerequisites: Requires spark.memory.offHeap.enabled=true and a non-zero spark.memory.offHeap.size. Without these, native memory requests receive 0-byte grants and allocations proceed untracked.
spark.indextables.native.memory.enabled: true (default: true, set false to use tantivy4java's unlimited pool)
// Required Spark settings for native memory tracking:
spark.memory.offHeap.enabled: true
spark.memory.offHeap.size: "4g" (recommended: 2-4x the largest writer heap or merge heap)Use DESCRIBE INDEXTABLES ENVIRONMENT to verify native memory integration is active (native_memory.configured = true) and monitor usage (native_memory.peak_bytes, native_memory.used_bytes).
spark.indextables.companion.sync.distributedLogRead.enabled: true (default: true)
// When true, BUILD INDEXTABLES COMPANION reads the source table's transaction log
// in a distributed fashion across Spark executors, avoiding driver OOM for tables
// with millions of files. Falls back to single-call path on failure.
spark.indextables.companion.sync.arrowFfi.enabled: true (default: true)
// When true and a WHERE clause provides a PartitionFilter, distributed checkpoint/manifest
// reads use Arrow FFI (zero-copy columnar export) instead of TANT buffer serialization.
// Eliminates per-entry JNI overhead. Set to false to use the TANT buffer path.
spark.indextables.companion.sync.batchSize: <auto> (default: defaultParallelism)
spark.indextables.companion.sync.maxConcurrentBatches: 6 (default: 6)
spark.indextables.companion.writerHeapSize: "1G" (default: 1GB)
spark.indextables.companion.readerBatchSize: 8192 (default: 8192)
spark.indextables.companion.schedulerPool: "indextables-companion" (default)