Skip to content

Latest commit

 

History

History
409 lines (293 loc) · 9.72 KB

File metadata and controls

409 lines (293 loc) · 9.72 KB

Testing Guide for ChunkingTest

This guide provides step-by-step examples for testing all features of the ChunkingTest application.

Prerequisites Setup

# Start PostgreSQL with required extensions (TimescaleDB with pg_textsearch, pgvector, pgvectorscale)
docker-compose up -d

# Create and migrate the database
mix ecto.create
mix ecto.migrate

# Start Ollama with the embedding model
ollama pull nomic-embed-text

# Start IEx
iex -S mix

1. Document Ingestion

Single File Ingestion

# Basic ingestion
{:ok, doc} = ChunkingTest.ingest("/home/dipayan/Downloads/NIPS-2017-attention-is-all-you-need-Paper.pdf")

# Check the result
IO.inspect(doc.id, label: "Document ID")
IO.inspect(doc.filename, label: "Filename")
IO.inspect(length(doc.chunks), label: "Chunk count")

# View a chunk
doc.chunks |> List.first() |> IO.inspect(label: "First chunk")

Ingestion with Specific Chunking Config

# Available presets: "precise", "balanced", "contextual", "fixed_512"
{:ok, doc} = ChunkingTest.ingest("/home/dipayan/Downloads/NIPS-2017-attention-is-all-you-need-Paper.pdf", chunking_config: "precise")
Preset max_chars overlap strategy Use Case
precise 512 64 semantic Focused retrieval
balanced 1024 128 semantic General purpose
contextual 2048 256 semantic More context
fixed_512 512 50 fixed Baseline comparison

Batch Ingestion

files = ["/path/to/doc1.pdf", "/path/to/doc2.txt", "/path/to/doc3.md"]
{:ok, docs} = ChunkingTest.ingest_batch(files)

IO.puts("Ingested #{length(docs)} documents")

Directory Ingestion

{:ok, docs} = ChunkingTest.ingest_directory("/path/to/documents/folder")

2. Embedding Generation

Generate Embeddings for a Document

{:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id)

IO.puts("Generated #{length(embeddings)} embeddings")

# Check embedding dimensions
embeddings |> List.first() |> Map.get(:vector) |> length() |> IO.inspect(label: "Dimensions")

Generate with Specific Embedding Config

{:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id, embedding_config: "ollama_nomic")
Preset Provider Model Dimensions
ollama_nomic ollama nomic-embed-text 768
ollama_mxbai ollama mxbai-embed-large 1024

Check Ollama Availability

ChunkingTest.Embeddings.provider_available?()
# => true or false

3. BM25 Search

Basic BM25 Search

{:ok, results} = ChunkingTest.search("machine learning", method: :bm25)

# Display results
Enum.each(results, fn r ->
  IO.puts("Score: #{Float.round(r.score, 4)}")
  IO.puts("Content: #{String.slice(r.chunk.content, 0, 100)}...")
  IO.puts("---")
end)

BM25 with Options

{:ok, results} = ChunkingTest.search("neural networks",
  method: :bm25,
  top_k: 5
)

4. Vector Search

Basic Vector Search

{:ok, results} = ChunkingTest.search("deep learning concepts", method: :vector)

Enum.each(results, fn r ->
  IO.puts("Similarity: #{Float.round(r.score, 4)}")
  IO.puts("Content: #{String.slice(r.chunk.content, 0, 100)}...")
  IO.puts("---")
end)

Vector Search with Minimum Similarity

{:ok, results} = ChunkingTest.search("transformers",
  method: :vector,
  top_k: 10,
  min_similarity: 0.7
)

5. Hybrid Search

Basic Hybrid Search

# Hybrid search is the default method
{:ok, results} = ChunkingTest.search("attention mechanism")

Hybrid with Custom Alpha

Alpha controls the weight between vector and BM25:

  • alpha=1.0 means pure vector search
  • alpha=0.0 means pure BM25 search
  • alpha=0.5 means equal weight (default)
{:ok, results} = ChunkingTest.search("attention mechanism",
  method: :hybrid,
  alpha: 0.7,  # 70% vector, 30% BM25
  top_k: 10
)

Enum.each(results, fn r ->
  IO.puts("Hybrid: #{Float.round(r.score, 4)} | BM25: #{r.bm25_score && Float.round(r.bm25_score, 4)} | Vector: #{r.vector_score && Float.round(r.vector_score, 4)}")
  IO.puts("Content: #{String.slice(r.chunk.content, 0, 80)}...")
  IO.puts("---")
end)

Hybrid with Reciprocal Rank Fusion

{:ok, results} = ChunkingTest.search("query",
  method: :hybrid,
  fusion_method: :rrf,
  fusion_k: 60
)

6. Compare Search Methods

{:ok, comparison} = ChunkingTest.Search.compare("information retrieval")

# View results per method
Enum.each(comparison.results, fn {method, results} ->
  IO.puts("\n=== #{method} ===")
  results |> Enum.take(3) |> Enum.each(fn r ->
    IO.puts("  #{Float.round(r.score, 4)}: #{String.slice(r.chunk.content, 0, 60)}...")
  end)
end)

7. Re-chunking Documents

Re-chunk an existing document with a different chunking configuration:

{:ok, doc} = ChunkingTest.rechunk(doc.id, chunking_config: "contextual")

IO.puts("New chunk count: #{length(doc.chunks)}")

8. Benchmarking

Create Ground Truth

First, get some chunk IDs from search results:

{:ok, results} = ChunkingTest.search("your test query", method: :hybrid, top_k: 5)
chunk_ids = Enum.map(results, & &1.chunk.id)

# Create ground truth with relevance judgments
# Format: {chunk_id, relevance_score (0.0-1.0), notes}
relevances = [
  {Enum.at(chunk_ids, 0), 1.0, "highly relevant"},
  {Enum.at(chunk_ids, 1), 0.8, "relevant"},
  {Enum.at(chunk_ids, 2), 0.5, "somewhat relevant"},
  {Enum.at(chunk_ids, 3), 0.2, "marginally relevant"},
  {Enum.at(chunk_ids, 4), 0.0, "not relevant"}
]

{:ok, query} = ChunkingTest.Benchmark.create_ground_truth(
  "your test query",
  relevances,
  description: "Test query for ML concepts",
  category: "machine_learning",
  difficulty: "medium"
)

Run Full Benchmark

{:ok, results} = ChunkingTest.Benchmark.run()

# View results
Enum.each(results, fn r ->
  IO.puts("\n=== #{r.method} ===")
  IO.puts("Queries: #{r.query_count}")
  IO.puts("MRR: #{Float.round(r.mrr, 4)}")
  IO.puts("MAP: #{Float.round(r.map, 4)}")
  IO.puts("Avg Latency: #{Float.round(r.latency_stats.avg, 2)}ms")
  IO.puts("P@5: #{Float.round(r.precision_at_k[5], 4)}")
  IO.puts("R@5: #{Float.round(r.recall_at_k[5], 4)}")
  IO.puts("NDCG@5: #{Float.round(r.ndcg_at_k[5], 4)}")
end)

Quick Single-Query Benchmark

No ground truth needed:

{:ok, results} = ChunkingTest.Benchmark.run_single("test query")

Enum.each(results, fn {method, data} ->
  case data do
    %{latency_ms: latency, results: res} ->
      IO.puts("#{method}: #{length(res)} results in #{Float.round(latency, 2)}ms")
    {:error, reason} ->
      IO.puts("#{method}: ERROR - #{inspect(reason)}")
  end
end)

Compare Configurations

# Get config IDs
balanced = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.ChunkingConfig, name: "balanced")
precise = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.ChunkingConfig, name: "precise")
embedding = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.EmbeddingConfig, name: "ollama_nomic")

# Compare different chunking configs with same embedding
config_pairs = [
  {balanced.id, embedding.id},
  {precise.id, embedding.id}
]

{:ok, comparison} = ChunkingTest.Benchmark.compare_configs(config_pairs)

9. Viewing Logs

The application logs at important points. To see debug logs:

# Enable debug logging at runtime
Logger.configure(level: :debug)

# Now run operations and watch the logs
{:ok, doc} = ChunkingTest.ingest("/path/to/file.pdf")

Or configure in config/config.exs:

config :logger, level: :debug

10. Quick Full Test Script

Run this after setting up the database and Ollama to verify everything works:

# 1. Ingest a test file
{:ok, doc} = ChunkingTest.ingest("/path/to/test.pdf")
IO.puts("✓ Ingested: #{doc.filename} (#{length(doc.chunks)} chunks)")

# 2. Generate embeddings
{:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id)
IO.puts("✓ Generated #{length(embeddings)} embeddings")

# 3. Test all search methods
query = "your search query"

{:ok, bm25} = ChunkingTest.search(query, method: :bm25, top_k: 3)
IO.puts("✓ BM25: #{length(bm25)} results")

{:ok, vector} = ChunkingTest.search(query, method: :vector, top_k: 3)
IO.puts("✓ Vector: #{length(vector)} results")

{:ok, hybrid} = ChunkingTest.search(query, method: :hybrid, top_k: 3)
IO.puts("✓ Hybrid: #{length(hybrid)} results")

IO.puts("\n=== All tests passed! ===")

Supported File Types

ChunkingTest uses Kreuzberg for document extraction, which supports:

  • PDF (.pdf)
  • Microsoft Office (.docx, .xlsx, .pptx)
  • OpenDocument (.odt, .ods, .odp)
  • Text files (.txt, .md, .rst)
  • Code files (.ex, .py, .js, .ts, etc.)
  • HTML (.html, .htm)
  • Images with OCR (.png, .jpg, .jpeg, .tiff)

Troubleshooting

Ollama Not Available

# Check if Ollama is running
ChunkingTest.Embeddings.provider_available?()

# List available models
ChunkingTest.Embeddings.Providers.Ollama.list_models()

# Pull a model if needed
ChunkingTest.Embeddings.Providers.Ollama.pull_model("nomic-embed-text")

Database Connection Issues

# Check if PostgreSQL is running
docker-compose ps

# View logs
docker-compose logs db

# Reset database
mix ecto.drop
mix ecto.create
mix ecto.migrate

No Search Results

  1. Verify documents are ingested: ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Document, :count)
  2. Verify chunks exist: ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Chunk, :count)
  3. For vector search, verify embeddings exist: ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Embedding, :count)