This guide provides step-by-step examples for testing all features of the ChunkingTest application.
# Start PostgreSQL with required extensions (TimescaleDB with pg_textsearch, pgvector, pgvectorscale)
docker-compose up -d
# Create and migrate the database
mix ecto.create
mix ecto.migrate
# Start Ollama with the embedding model
ollama pull nomic-embed-text
# Start IEx
iex -S mix# Basic ingestion
{:ok, doc} = ChunkingTest.ingest("/home/dipayan/Downloads/NIPS-2017-attention-is-all-you-need-Paper.pdf")
# Check the result
IO.inspect(doc.id, label: "Document ID")
IO.inspect(doc.filename, label: "Filename")
IO.inspect(length(doc.chunks), label: "Chunk count")
# View a chunk
doc.chunks |> List.first() |> IO.inspect(label: "First chunk")# Available presets: "precise", "balanced", "contextual", "fixed_512"
{:ok, doc} = ChunkingTest.ingest("/home/dipayan/Downloads/NIPS-2017-attention-is-all-you-need-Paper.pdf", chunking_config: "precise")| Preset | max_chars | overlap | strategy | Use Case |
|---|---|---|---|---|
| precise | 512 | 64 | semantic | Focused retrieval |
| balanced | 1024 | 128 | semantic | General purpose |
| contextual | 2048 | 256 | semantic | More context |
| fixed_512 | 512 | 50 | fixed | Baseline comparison |
files = ["/path/to/doc1.pdf", "/path/to/doc2.txt", "/path/to/doc3.md"]
{:ok, docs} = ChunkingTest.ingest_batch(files)
IO.puts("Ingested #{length(docs)} documents"){:ok, docs} = ChunkingTest.ingest_directory("/path/to/documents/folder"){:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id)
IO.puts("Generated #{length(embeddings)} embeddings")
# Check embedding dimensions
embeddings |> List.first() |> Map.get(:vector) |> length() |> IO.inspect(label: "Dimensions"){:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id, embedding_config: "ollama_nomic")| Preset | Provider | Model | Dimensions |
|---|---|---|---|
| ollama_nomic | ollama | nomic-embed-text | 768 |
| ollama_mxbai | ollama | mxbai-embed-large | 1024 |
ChunkingTest.Embeddings.provider_available?()
# => true or false{:ok, results} = ChunkingTest.search("machine learning", method: :bm25)
# Display results
Enum.each(results, fn r ->
IO.puts("Score: #{Float.round(r.score, 4)}")
IO.puts("Content: #{String.slice(r.chunk.content, 0, 100)}...")
IO.puts("---")
end){:ok, results} = ChunkingTest.search("neural networks",
method: :bm25,
top_k: 5
){:ok, results} = ChunkingTest.search("deep learning concepts", method: :vector)
Enum.each(results, fn r ->
IO.puts("Similarity: #{Float.round(r.score, 4)}")
IO.puts("Content: #{String.slice(r.chunk.content, 0, 100)}...")
IO.puts("---")
end){:ok, results} = ChunkingTest.search("transformers",
method: :vector,
top_k: 10,
min_similarity: 0.7
)# Hybrid search is the default method
{:ok, results} = ChunkingTest.search("attention mechanism")Alpha controls the weight between vector and BM25:
alpha=1.0means pure vector searchalpha=0.0means pure BM25 searchalpha=0.5means equal weight (default)
{:ok, results} = ChunkingTest.search("attention mechanism",
method: :hybrid,
alpha: 0.7, # 70% vector, 30% BM25
top_k: 10
)
Enum.each(results, fn r ->
IO.puts("Hybrid: #{Float.round(r.score, 4)} | BM25: #{r.bm25_score && Float.round(r.bm25_score, 4)} | Vector: #{r.vector_score && Float.round(r.vector_score, 4)}")
IO.puts("Content: #{String.slice(r.chunk.content, 0, 80)}...")
IO.puts("---")
end){:ok, results} = ChunkingTest.search("query",
method: :hybrid,
fusion_method: :rrf,
fusion_k: 60
){:ok, comparison} = ChunkingTest.Search.compare("information retrieval")
# View results per method
Enum.each(comparison.results, fn {method, results} ->
IO.puts("\n=== #{method} ===")
results |> Enum.take(3) |> Enum.each(fn r ->
IO.puts(" #{Float.round(r.score, 4)}: #{String.slice(r.chunk.content, 0, 60)}...")
end)
end)Re-chunk an existing document with a different chunking configuration:
{:ok, doc} = ChunkingTest.rechunk(doc.id, chunking_config: "contextual")
IO.puts("New chunk count: #{length(doc.chunks)}")First, get some chunk IDs from search results:
{:ok, results} = ChunkingTest.search("your test query", method: :hybrid, top_k: 5)
chunk_ids = Enum.map(results, & &1.chunk.id)
# Create ground truth with relevance judgments
# Format: {chunk_id, relevance_score (0.0-1.0), notes}
relevances = [
{Enum.at(chunk_ids, 0), 1.0, "highly relevant"},
{Enum.at(chunk_ids, 1), 0.8, "relevant"},
{Enum.at(chunk_ids, 2), 0.5, "somewhat relevant"},
{Enum.at(chunk_ids, 3), 0.2, "marginally relevant"},
{Enum.at(chunk_ids, 4), 0.0, "not relevant"}
]
{:ok, query} = ChunkingTest.Benchmark.create_ground_truth(
"your test query",
relevances,
description: "Test query for ML concepts",
category: "machine_learning",
difficulty: "medium"
){:ok, results} = ChunkingTest.Benchmark.run()
# View results
Enum.each(results, fn r ->
IO.puts("\n=== #{r.method} ===")
IO.puts("Queries: #{r.query_count}")
IO.puts("MRR: #{Float.round(r.mrr, 4)}")
IO.puts("MAP: #{Float.round(r.map, 4)}")
IO.puts("Avg Latency: #{Float.round(r.latency_stats.avg, 2)}ms")
IO.puts("P@5: #{Float.round(r.precision_at_k[5], 4)}")
IO.puts("R@5: #{Float.round(r.recall_at_k[5], 4)}")
IO.puts("NDCG@5: #{Float.round(r.ndcg_at_k[5], 4)}")
end)No ground truth needed:
{:ok, results} = ChunkingTest.Benchmark.run_single("test query")
Enum.each(results, fn {method, data} ->
case data do
%{latency_ms: latency, results: res} ->
IO.puts("#{method}: #{length(res)} results in #{Float.round(latency, 2)}ms")
{:error, reason} ->
IO.puts("#{method}: ERROR - #{inspect(reason)}")
end
end)# Get config IDs
balanced = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.ChunkingConfig, name: "balanced")
precise = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.ChunkingConfig, name: "precise")
embedding = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.EmbeddingConfig, name: "ollama_nomic")
# Compare different chunking configs with same embedding
config_pairs = [
{balanced.id, embedding.id},
{precise.id, embedding.id}
]
{:ok, comparison} = ChunkingTest.Benchmark.compare_configs(config_pairs)The application logs at important points. To see debug logs:
# Enable debug logging at runtime
Logger.configure(level: :debug)
# Now run operations and watch the logs
{:ok, doc} = ChunkingTest.ingest("/path/to/file.pdf")Or configure in config/config.exs:
config :logger, level: :debugRun this after setting up the database and Ollama to verify everything works:
# 1. Ingest a test file
{:ok, doc} = ChunkingTest.ingest("/path/to/test.pdf")
IO.puts("✓ Ingested: #{doc.filename} (#{length(doc.chunks)} chunks)")
# 2. Generate embeddings
{:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id)
IO.puts("✓ Generated #{length(embeddings)} embeddings")
# 3. Test all search methods
query = "your search query"
{:ok, bm25} = ChunkingTest.search(query, method: :bm25, top_k: 3)
IO.puts("✓ BM25: #{length(bm25)} results")
{:ok, vector} = ChunkingTest.search(query, method: :vector, top_k: 3)
IO.puts("✓ Vector: #{length(vector)} results")
{:ok, hybrid} = ChunkingTest.search(query, method: :hybrid, top_k: 3)
IO.puts("✓ Hybrid: #{length(hybrid)} results")
IO.puts("\n=== All tests passed! ===")ChunkingTest uses Kreuzberg for document extraction, which supports:
- PDF (.pdf)
- Microsoft Office (.docx, .xlsx, .pptx)
- OpenDocument (.odt, .ods, .odp)
- Text files (.txt, .md, .rst)
- Code files (.ex, .py, .js, .ts, etc.)
- HTML (.html, .htm)
- Images with OCR (.png, .jpg, .jpeg, .tiff)
# Check if Ollama is running
ChunkingTest.Embeddings.provider_available?()
# List available models
ChunkingTest.Embeddings.Providers.Ollama.list_models()
# Pull a model if needed
ChunkingTest.Embeddings.Providers.Ollama.pull_model("nomic-embed-text")# Check if PostgreSQL is running
docker-compose ps
# View logs
docker-compose logs db
# Reset database
mix ecto.drop
mix ecto.create
mix ecto.migrate- Verify documents are ingested:
ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Document, :count) - Verify chunks exist:
ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Chunk, :count) - For vector search, verify embeddings exist:
ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Embedding, :count)