Skip to content

feat: add vector column support for Lance via arrow FixedSizeList conversion#75

Merged
LuQQiu merged 13 commits into
lance-format:mainfrom
LuQQiu:lu/fixedSizeList2
Aug 8, 2025
Merged

feat: add vector column support for Lance via arrow FixedSizeList conversion#75
LuQQiu merged 13 commits into
lance-format:mainfrom
LuQQiu:lu/fixedSizeList2

Conversation

@LuQQiu

@LuQQiu LuQQiu commented Aug 7, 2025

Copy link
Copy Markdown
Contributor

This PR adds support for writing Spark DataFrame ArrayType columns as Arrow FixedSizeList to enable vector embeddings storage and indexing in Lance format. Users can now mark array columns as vectors using metadata, enabling efficient similarity search and vector indexing for ML workloads.

copied Spark Dataframe to Arrow Type converter and add custom Spark Array<Float/Double> to Arrow FixedSizeList conversion.

@LuQQiu LuQQiu requested a review from jackye1995 August 7, 2025 21:41
@github-actions

github-actions Bot commented Aug 7, 2025

Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@LuQQiu LuQQiu changed the title feat: Add vector column support for ML workloads via FixedSizeList conversion feat: add vector column support for Lance via FixedSizeList conversion Aug 7, 2025
@LuQQiu LuQQiu changed the title feat: add vector column support for Lance via FixedSizeList conversion feat: add vector column support for Lance via arrow FixedSizeList conversion Aug 7, 2025
import scala.collection.JavaConverters._

object LanceArrowUtils {
val ARROW_FIXED_SIZE_LIST_SIZE_KEY = "arrow.FixedSizeList.size"

@jackye1995 jackye1995 Aug 7, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an existing parameter, or something new? Feels strange to have camel case here, if we can change it it should use something like arrow.fixed-size-list.size or arrow_fixed_size_list_size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a new parameter added for identify which column needs conversion, OKKK can do arrow.fixed-size-list.size or arrow_fixed_size_list_size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to arrow.fixed-size-list.size

String outputPath = "/tmp/vector_test_float32.lance";
deleteDirectory(new File(outputPath));

df.write()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably covers the data frame write case. What about if we just run CREATE TABLE to create an empty table? for example https://github.com/lancedb/lance-spark/blob/main/lance-spark-base_2.12/src/test/java/com/lancedb/lance/spark/SparkLanceNamespaceTestBase.java#L135.

I think we can do something like

CREATE TABLE (vector ARRAY<FLOAT>)
TBLPROPERTIES (
  'vector.arrow.fixed-size-list.size'='1024'
)

And then you can do the conversion at https://github.com/lancedb/lance-spark/blob/main/lance-spark-base_2.12/src/main/java/com/lancedb/lance/spark/utils/SchemaConverter.java to take the additional table properties values at https://github.com/lancedb/lance-spark/blob/main/lance-spark-base_2.12/src/main/java/com/lancedb/lance/spark/LanceNamespaceSparkCatalog.java#L408

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Forget about SQL statement!, have updated to use your approach

Comment thread docs/src/user-guide/write.md Outdated
newDF.write().mode("append").saveAsTable("users");
```

## Writing Vector Columns

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update those separate doc!

@LuQQiu

LuQQiu commented Aug 8, 2025

Copy link
Copy Markdown
Contributor Author

Require lance-format/lance-namespace#187 to pass Spark / build-and-test (17) (pull_request)
Spark / build-and-test (17) (pull_request)Failing after 1m

Address all the comments, @jackye1995 PTAL, thanks

@LuQQiu LuQQiu changed the title feat: add vector column support for Lance via arrow FixedSizeList conversion feat: add vector column support for Lance via arrow FixedSizeList conversion Aug 8, 2025
@github-actions github-actions Bot added the enhancement New feature or request label Aug 8, 2025
@jackye1995

Copy link
Copy Markdown
Contributor

the 0.0.7 namespace jars are released. Make sure you do a search of all 0.0.6 and update to 0.0.7, there are some references in documentation that should be fixed along the way.

Also can you bump the Spark jar versions so we can publish a release after merging this

LuQQiu added 3 commits August 8, 2025 09:54
- Updated lance-namespace-glue in docs examples
- Updated lance-namespace-hive3 in docs examples
- Updated LANCE_NS_VERSION in Dockerfile
- Updated lance-namespace-glue test dependency in pom.xml
- Split FixedSizeListVectorTest into two focused test classes:
  - FixedSizeListSQLTest: Tests SQL CREATE TABLE and INSERT operations
  - FixedSizeListDataFrameTest: Tests DataFrame API write/read operations
- Each test now validates both write and read paths
- Added verification to VectorWriteTest to ensure data can be read back
- Removed duplication and improved test organization
@LuQQiu LuQQiu force-pushed the lu/fixedSizeList2 branch from ee3c8ae to 828b707 Compare August 8, 2025 17:24
- Updated all pom.xml versions from 0.0.5 to 0.0.6
- Updated lance-spark.version property to 0.0.6
- Updated documentation examples to use 0.0.6
- Updated Dockerfile LANCE_SPARK_VERSION to 0.0.6
- All lance-namespace dependencies already at 0.0.7
- Removed redundant VectorWriteTest (covered by FixedSizeListDataFrameTest)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@LuQQiu LuQQiu force-pushed the lu/fixedSizeList2 branch from 828b707 to dbc6b30 Compare August 8, 2025 18:36
- Replace scala.collection.mutable.WrappedArray with scala.collection.Seq
- Use Row.getSeq() method for cross-version compatibility
- Fixes compilation errors in Scala 2.13 builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@LuQQiu LuQQiu merged commit 3fd6143 into lance-format:main Aug 8, 2025
3 checks passed
jiaoew1991 pushed a commit to jiaoew1991/lance-spark that referenced this pull request Sep 16, 2025
…version (lance-format#75)

This PR adds support for writing Spark DataFrame `ArrayType` columns as
Arrow `FixedSizeList` to enable vector embeddings storage and indexing
in Lance format. Users can now mark array columns as vectors using
metadata, enabling efficient similarity search and vector indexing for
ML workloads.

copied Spark Dataframe to Arrow Type converter and add custom Spark
Array<Float/Double> to Arrow FixedSizeList conversion.

---------

Co-authored-by: Claude <noreply@anthropic.com>
jiaoew1991 pushed a commit to jiaoew1991/lance-spark that referenced this pull request Dec 1, 2025
…version (lance-format#75)

This PR adds support for writing Spark DataFrame `ArrayType` columns as
Arrow `FixedSizeList` to enable vector embeddings storage and indexing
in Lance format. Users can now mark array columns as vectors using
metadata, enabling efficient similarity search and vector indexing for
ML workloads.

copied Spark Dataframe to Arrow Type converter and add custom Spark
Array<Float/Double> to Arrow FixedSizeList conversion.

---------

Co-authored-by: Claude <noreply@anthropic.com>
jiaoew1991 pushed a commit to jiaoew1991/lance-spark that referenced this pull request Feb 25, 2026
…version (lance-format#75)

This PR adds support for writing Spark DataFrame `ArrayType` columns as
Arrow `FixedSizeList` to enable vector embeddings storage and indexing
in Lance format. Users can now mark array columns as vectors using
metadata, enabling efficient similarity search and vector indexing for
ML workloads.

copied Spark Dataframe to Arrow Type converter and add custom Spark
Array<Float/Double> to Arrow FixedSizeList conversion.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants