feat: add vector column support for Lance via arrow FixedSizeList conversion#75
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
| import scala.collection.JavaConverters._ | ||
|
|
||
| object LanceArrowUtils { | ||
| val ARROW_FIXED_SIZE_LIST_SIZE_KEY = "arrow.FixedSizeList.size" |
There was a problem hiding this comment.
is this an existing parameter, or something new? Feels strange to have camel case here, if we can change it it should use something like arrow.fixed-size-list.size or arrow_fixed_size_list_size
There was a problem hiding this comment.
it's a new parameter added for identify which column needs conversion, OKKK can do arrow.fixed-size-list.size or arrow_fixed_size_list_size
There was a problem hiding this comment.
changed to arrow.fixed-size-list.size
| String outputPath = "/tmp/vector_test_float32.lance"; | ||
| deleteDirectory(new File(outputPath)); | ||
|
|
||
| df.write() |
There was a problem hiding this comment.
this probably covers the data frame write case. What about if we just run CREATE TABLE to create an empty table? for example https://github.com/lancedb/lance-spark/blob/main/lance-spark-base_2.12/src/test/java/com/lancedb/lance/spark/SparkLanceNamespaceTestBase.java#L135.
I think we can do something like
CREATE TABLE (vector ARRAY<FLOAT>)
TBLPROPERTIES (
'vector.arrow.fixed-size-list.size'='1024'
)
And then you can do the conversion at https://github.com/lancedb/lance-spark/blob/main/lance-spark-base_2.12/src/main/java/com/lancedb/lance/spark/utils/SchemaConverter.java to take the additional table properties values at https://github.com/lancedb/lance-spark/blob/main/lance-spark-base_2.12/src/main/java/com/lancedb/lance/spark/LanceNamespaceSparkCatalog.java#L408
There was a problem hiding this comment.
Good call! Forget about SQL statement!, have updated to use your approach
| newDF.write().mode("append").saveAsTable("users"); | ||
| ``` | ||
|
|
||
| ## Writing Vector Columns |
There was a problem hiding this comment.
I just splitted the doc to be more detailed and one operation per page. You can make the changes in https://github.com/lancedb/lance-spark/blob/main/docs/src/operations/ddl/create-table.md and https://github.com/lancedb/lance-spark/blob/main/docs/src/operations/ddl/dataframe-create-table.md
There was a problem hiding this comment.
Update those separate doc!
|
Require lance-format/lance-namespace#187 to pass Spark / build-and-test (17) (pull_request) Address all the comments, @jackye1995 PTAL, thanks |
|
the 0.0.7 namespace jars are released. Make sure you do a search of all 0.0.6 and update to 0.0.7, there are some references in documentation that should be fixed along the way. Also can you bump the Spark jar versions so we can publish a release after merging this |
- Updated lance-namespace-glue in docs examples - Updated lance-namespace-hive3 in docs examples - Updated LANCE_NS_VERSION in Dockerfile - Updated lance-namespace-glue test dependency in pom.xml
- Split FixedSizeListVectorTest into two focused test classes: - FixedSizeListSQLTest: Tests SQL CREATE TABLE and INSERT operations - FixedSizeListDataFrameTest: Tests DataFrame API write/read operations - Each test now validates both write and read paths - Added verification to VectorWriteTest to ensure data can be read back - Removed duplication and improved test organization
ee3c8ae to
828b707
Compare
- Updated all pom.xml versions from 0.0.5 to 0.0.6 - Updated lance-spark.version property to 0.0.6 - Updated documentation examples to use 0.0.6 - Updated Dockerfile LANCE_SPARK_VERSION to 0.0.6 - All lance-namespace dependencies already at 0.0.7 - Removed redundant VectorWriteTest (covered by FixedSizeListDataFrameTest) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
828b707 to
dbc6b30
Compare
- Replace scala.collection.mutable.WrappedArray with scala.collection.Seq - Use Row.getSeq() method for cross-version compatibility - Fixes compilation errors in Scala 2.13 builds 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…version (lance-format#75) This PR adds support for writing Spark DataFrame `ArrayType` columns as Arrow `FixedSizeList` to enable vector embeddings storage and indexing in Lance format. Users can now mark array columns as vectors using metadata, enabling efficient similarity search and vector indexing for ML workloads. copied Spark Dataframe to Arrow Type converter and add custom Spark Array<Float/Double> to Arrow FixedSizeList conversion. --------- Co-authored-by: Claude <noreply@anthropic.com>
…version (lance-format#75) This PR adds support for writing Spark DataFrame `ArrayType` columns as Arrow `FixedSizeList` to enable vector embeddings storage and indexing in Lance format. Users can now mark array columns as vectors using metadata, enabling efficient similarity search and vector indexing for ML workloads. copied Spark Dataframe to Arrow Type converter and add custom Spark Array<Float/Double> to Arrow FixedSizeList conversion. --------- Co-authored-by: Claude <noreply@anthropic.com>
…version (lance-format#75) This PR adds support for writing Spark DataFrame `ArrayType` columns as Arrow `FixedSizeList` to enable vector embeddings storage and indexing in Lance format. Users can now mark array columns as vectors using metadata, enabling efficient similarity search and vector indexing for ML workloads. copied Spark Dataframe to Arrow Type converter and add custom Spark Array<Float/Double> to Arrow FixedSizeList conversion. --------- Co-authored-by: Claude <noreply@anthropic.com>
This PR adds support for writing Spark DataFrame
ArrayTypecolumns as ArrowFixedSizeListto enable vector embeddings storage and indexing in Lance format. Users can now mark array columns as vectors using metadata, enabling efficient similarity search and vector indexing for ML workloads.copied Spark Dataframe to Arrow Type converter and add custom Spark Array<Float/Double> to Arrow FixedSizeList conversion.