| layout | default |
|---|---|
| title | Chapter 2: Data Modeling |
| parent | LanceDB Tutorial |
| nav_order | 2 |
Welcome to Chapter 2: Data Modeling. In this part of LanceDB Tutorial: Serverless Vector Database for AI, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Design effective schemas for vector data, understand Lance data types, and model complex data structures.
Proper data modeling is crucial for building efficient vector search applications. This chapter covers schema design, data types, Pydantic models, and best practices for structuring your data in LanceDB.
import lancedb
from lancedb.pydantic import LanceModel, Vector
from typing import Optional, List
from datetime import datetime
class Article(LanceModel):
"""Schema for news articles."""
id: str
title: str
content: str
author: str
published_at: datetime
category: str
tags: List[str]
vector: Vector(384) # Embedding dimension
metadata: Optional[dict] = None
# Create table with schema
db = lancedb.connect("./my_lancedb")
table = db.create_table("articles", schema=Article)
# Add data (validated against schema)
article = Article(
id="article-001",
title="Introduction to LanceDB",
content="LanceDB is a vector database...",
author="John Doe",
published_at=datetime.now(),
category="technology",
tags=["database", "vectors", "ai"],
vector=[0.1] * 384
)
table.add([article])import pyarrow as pa
import lancedb
# Define schema with PyArrow
schema = pa.schema([
pa.field("id", pa.string()),
pa.field("title", pa.string()),
pa.field("content", pa.string()),
pa.field("score", pa.float64()),
pa.field("count", pa.int32()),
pa.field("is_active", pa.bool_()),
pa.field("created_at", pa.timestamp("us")),
pa.field("tags", pa.list_(pa.string())),
pa.field("vector", pa.list_(pa.float32(), 384)),
pa.field("metadata", pa.struct([
pa.field("source", pa.string()),
pa.field("version", pa.int32())
]))
])
db = lancedb.connect("./my_lancedb")
table = db.create_table("pyarrow_table", schema=schema)from lancedb.pydantic import LanceModel, Vector
from datetime import datetime, date
from typing import Optional
class ScalarExample(LanceModel):
# Numeric types
integer_field: int
float_field: float
# String type
string_field: str
# Boolean type
bool_field: bool
# Date/Time types
datetime_field: datetime
date_field: date
# Optional fields
optional_string: Optional[str] = None
optional_int: Optional[int] = None
# Vector field
vector: Vector(384)from lancedb.pydantic import LanceModel, Vector
from typing import List, Dict, Any
class CollectionExample(LanceModel):
# List of strings
tags: List[str]
# List of numbers
scores: List[float]
# Nested list
matrix: List[List[float]]
# Dictionary (stored as struct)
metadata: Dict[str, Any]
# Vector (special list type)
vector: Vector(384)from lancedb.pydantic import LanceModel, Vector
from typing import List, Optional
from pydantic import BaseModel
# Nested model (not LanceModel)
class Address(BaseModel):
street: str
city: str
country: str
postal_code: str
class ContactInfo(BaseModel):
email: str
phone: Optional[str] = None
address: Optional[Address] = None
# Main model with nested structures
class Customer(LanceModel):
id: str
name: str
contact: ContactInfo
purchase_history: List[str]
vector: Vector(384)
# Usage
db = lancedb.connect("./my_lancedb")
table = db.create_table("customers", schema=Customer)
customer = Customer(
id="cust-001",
name="Alice Smith",
contact=ContactInfo(
email="alice@example.com",
phone="+1234567890",
address=Address(
street="123 Main St",
city="San Francisco",
country="USA",
postal_code="94102"
)
),
purchase_history=["order-001", "order-002"],
vector=[0.1] * 384
)
table.add([customer])from lancedb.pydantic import LanceModel, Vector
# Common embedding dimensions
class SmallEmbedding(LanceModel):
text: str
vector: Vector(384) # all-MiniLM-L6-v2
class MediumEmbedding(LanceModel):
text: str
vector: Vector(768) # BERT-base, all-mpnet-base-v2
class LargeEmbedding(LanceModel):
text: str
vector: Vector(1536) # OpenAI text-embedding-3-small
class XLargeEmbedding(LanceModel):
text: str
vector: Vector(3072) # OpenAI text-embedding-3-largefrom lancedb.pydantic import LanceModel, Vector
class MultiVectorDocument(LanceModel):
"""Document with multiple embedding types."""
id: str
title: str
content: str
# Different embeddings for different purposes
title_vector: Vector(384) # Title embedding
content_vector: Vector(768) # Content embedding
summary_vector: Vector(384) # Summary embedding
# Search on specific vector field
db = lancedb.connect("./my_lancedb")
table = db.create_table("multi_vector", schema=MultiVectorDocument)
# Search by title
results = table.search(query_vector, vector_column_name="title_vector").limit(10).to_pandas()
# Search by content
results = table.search(query_vector, vector_column_name="content_vector").limit(10).to_pandas()import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
# OpenAI embeddings
openai_embed = get_registry().get("openai").create(
name="text-embedding-3-small"
)
class OpenAIDocument(LanceModel):
text: str = openai_embed.SourceField()
vector: Vector(openai_embed.ndims()) = openai_embed.VectorField()
# Sentence Transformers
st_embed = get_registry().get("sentence-transformers").create(
name="all-MiniLM-L6-v2"
)
class STDocument(LanceModel):
text: str = st_embed.SourceField()
vector: Vector(st_embed.ndims()) = st_embed.VectorField()
# Usage - embeddings are created automatically
db = lancedb.connect("./my_lancedb")
table = db.create_table("auto_embed", schema=OpenAIDocument)
# Just add text, vector is computed automatically
table.add([
{"text": "Hello world"},
{"text": "LanceDB is great"},
])
# Search with text (automatically embedded)
results = table.search("greeting").limit(5).to_pandas()from lancedb.embeddings import EmbeddingFunction, EmbeddingFunctionRegistry
import numpy as np
@EmbeddingFunctionRegistry.register("my-custom-embedder")
class CustomEmbedder(EmbeddingFunction):
def __init__(self, model_name: str = "default"):
self.model_name = model_name
self._ndims = 384
def ndims(self) -> int:
return self._ndims
def compute_source_embeddings(self, texts: list[str]) -> list[list[float]]:
"""Compute embeddings for source texts."""
# Your embedding logic here
embeddings = []
for text in texts:
# Example: simple hash-based embedding (not for production!)
vec = np.random.RandomState(hash(text) % 2**32).rand(self._ndims)
embeddings.append(vec.tolist())
return embeddings
def compute_query_embeddings(self, query: str) -> list[float]:
"""Compute embedding for a query."""
return self.compute_source_embeddings([query])[0]
# Usage
custom_embed = get_registry().get("my-custom-embedder").create()
class CustomDocument(LanceModel):
text: str = custom_embed.SourceField()
vector: Vector(custom_embed.ndims()) = custom_embed.VectorField()from lancedb.pydantic import LanceModel, Vector
from typing import Optional, List
from datetime import datetime
class Document(LanceModel):
"""General document storage pattern."""
# Identity
id: str
source: str # 'web', 'pdf', 'api', etc.
# Content
title: str
content: str
summary: Optional[str] = None
# Metadata
author: Optional[str] = None
created_at: datetime
updated_at: Optional[datetime] = None
tags: List[str] = []
# Chunking info (for large documents)
chunk_index: Optional[int] = None
total_chunks: Optional[int] = None
parent_id: Optional[str] = None # Reference to parent document
# Embedding
vector: Vector(384)from lancedb.pydantic import LanceModel, Vector
from typing import Optional, List
from decimal import Decimal
class Product(LanceModel):
"""E-commerce product catalog pattern."""
# Identity
sku: str
name: str
# Description (for vector search)
description: str
features: List[str]
# Categorization
category: str
subcategory: Optional[str] = None
brand: str
# Pricing
price: float
currency: str = "USD"
# Inventory
in_stock: bool
quantity: int
# Media
image_url: Optional[str] = None
thumbnail_url: Optional[str] = None
# Search vectors
text_vector: Vector(384) # Text description embedding
image_vector: Optional[Vector(512)] = None # Image embeddingfrom lancedb.pydantic import LanceModel, Vector
from typing import Optional, List
from datetime import datetime
class ChatMessage(LanceModel):
"""Chat conversation storage pattern."""
# Identity
message_id: str
conversation_id: str
user_id: str
# Message content
role: str # 'user', 'assistant', 'system'
content: str
# Timing
timestamp: datetime
# Context
parent_message_id: Optional[str] = None
metadata: Optional[dict] = None
# For semantic search over chat history
vector: Vector(384)
# Example: Find relevant past conversations
def find_similar_conversations(query: str, user_id: str, table, embed_fn):
query_vector = embed_fn(query)
results = table.search(query_vector) \
.where(f"user_id = '{user_id}'") \
.limit(10) \
.to_pandas()
return resultsfrom lancedb.pydantic import LanceModel, Vector
from typing import Optional
from datetime import datetime
class MultiModalItem(LanceModel):
"""Multi-modal content (text + image + audio)."""
id: str
type: str # 'text', 'image', 'audio', 'video'
# Content references
text_content: Optional[str] = None
media_url: Optional[str] = None
thumbnail_url: Optional[str] = None
# Metadata
title: str
description: Optional[str] = None
duration_seconds: Optional[float] = None
file_size_bytes: Optional[int] = None
mime_type: Optional[str] = None
# Multiple embedding types
text_vector: Optional[Vector(384)] = None # Text embedding
image_vector: Optional[Vector(512)] = None # CLIP image embedding
audio_vector: Optional[Vector(256)] = None # Audio embedding
# Timestamps
created_at: datetime
updated_at: Optional[datetime] = Noneimport lancedb
import pyarrow as pa
db = lancedb.connect("./my_lancedb")
table = db.open_table("my_table")
# Add a new column with default value
table.add_columns({
"new_column": "default_value"
})
# Add computed column
table.add_columns({
"text_length": "LENGTH(content)"
})from lancedb.pydantic import LanceModel, Vector
from typing import Optional
# Original schema
class DocumentV1(LanceModel):
id: str
content: str
vector: Vector(384)
# Updated schema with new fields
class DocumentV2(LanceModel):
id: str
content: str
vector: Vector(384)
# New fields with defaults
title: Optional[str] = None
category: Optional[str] = "uncategorized"
version: int = 2
# Migration approach
def migrate_table(db, old_name: str, new_name: str):
old_table = db.open_table(old_name)
old_data = old_table.to_pandas()
# Add default values for new columns
old_data["title"] = None
old_data["category"] = "uncategorized"
old_data["version"] = 2
# Create new table
new_table = db.create_table(new_name, old_data)
return new_table# Table names: lowercase with underscores
table_name = "user_documents"
table_name = "product_catalog"
table_name = "chat_messages"
# Column names: lowercase with underscores
class GoodNaming(LanceModel):
user_id: str
created_at: datetime
is_active: bool
text_vector: Vector(384)
# Avoid
class BadNaming(LanceModel):
UserID: str # Avoid CamelCase
CreatedAt: datetime
isActive: bool # Avoid camelCase
textVector: Vector(384)# Put vector column last (better for scanning)
class OptimalLayout(LanceModel):
id: str
title: str
content: str
category: str
created_at: datetime
# Vector last
vector: Vector(384)# Good: Equality on indexed columns
results = table.search(query_vector) \
.where("category = 'news'") \
.limit(10)
# Good: Range queries
results = table.search(query_vector) \
.where("price >= 10 AND price <= 100") \
.limit(10)
# Avoid: Complex expressions that can't use indexes
# results = table.search(query_vector) \
# .where("LOWER(title) LIKE '%search%'") \
# .limit(10)In this chapter, you've learned:
- Schema Definition: Using Pydantic and PyArrow schemas
- Data Types: Scalar, collection, and nested types
- Vector Fields: Dimensions, multiple vectors, and embedding functions
- Design Patterns: Document store, product catalog, chat history
- Schema Evolution: Adding columns and handling migrations
- Best Practices: Naming, layout, and filter optimization
- Use Pydantic: Type-safe schemas with validation
- Choose Dimensions Wisely: Match your embedding model
- Multiple Vectors: Different vectors for different search needs
- Auto-Embedding: Built-in embedding functions simplify development
- Plan for Evolution: Design schemas that can grow
Now that you understand data modeling, let's explore Vector Operations in Chapter 3 for advanced similarity search techniques.
Ready for Chapter 3? Vector Operations
Generated for Awesome Code Docs
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for Vector, Optional, LanceModel so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 2: Data Modeling as an operating subsystem inside LanceDB Tutorial: Serverless Vector Database for AI, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around None, lancedb, vector as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 2: Data Modeling usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
Vector. - Input normalization: shape incoming data so
Optionalreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
LanceModel. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- Awesome Code Docs
Why it matters: authoritative reference on
Awesome Code Docs(github.com).
Suggested trace strategy:
- search upstream code for
VectorandOptionalto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production