Testing Guide for ChunkingTest

This guide provides step-by-step examples for testing all features of the ChunkingTest application.

Prerequisites Setup

# Start PostgreSQL with required extensions (TimescaleDB with pg_textsearch, pgvector, pgvectorscale)
docker-compose up -d

# Create and migrate the database
mix ecto.create
mix ecto.migrate

# Start Ollama with the embedding model
ollama pull nomic-embed-text

# Start IEx
iex -S mix

1. Document Ingestion

Single File Ingestion

# Basic ingestion
{:ok, doc} = ChunkingTest.ingest("/home/dipayan/Downloads/NIPS-2017-attention-is-all-you-need-Paper.pdf")

# Check the result
IO.inspect(doc.id, label: "Document ID")
IO.inspect(doc.filename, label: "Filename")
IO.inspect(length(doc.chunks), label: "Chunk count")

# View a chunk
doc.chunks |> List.first() |> IO.inspect(label: "First chunk")

Ingestion with Specific Chunking Config

# Available presets: "precise", "balanced", "contextual", "fixed_512"
{:ok, doc} = ChunkingTest.ingest("/home/dipayan/Downloads/NIPS-2017-attention-is-all-you-need-Paper.pdf", chunking_config: "precise")

Preset	max_chars	overlap	strategy	Use Case
precise	512	64	semantic	Focused retrieval
balanced	1024	128	semantic	General purpose
contextual	2048	256	semantic	More context
fixed_512	512	50	fixed	Baseline comparison

Batch Ingestion

files = ["/path/to/doc1.pdf", "/path/to/doc2.txt", "/path/to/doc3.md"]
{:ok, docs} = ChunkingTest.ingest_batch(files)

IO.puts("Ingested #{length(docs)} documents")

Directory Ingestion

{:ok, docs} = ChunkingTest.ingest_directory("/path/to/documents/folder")

2. Embedding Generation

Generate Embeddings for a Document

{:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id)

IO.puts("Generated #{length(embeddings)} embeddings")

# Check embedding dimensions
embeddings |> List.first() |> Map.get(:vector) |> length() |> IO.inspect(label: "Dimensions")

Generate with Specific Embedding Config

{:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id, embedding_config: "ollama_nomic")

Preset	Provider	Model	Dimensions
ollama_nomic	ollama	nomic-embed-text	768
ollama_mxbai	ollama	mxbai-embed-large	1024

Check Ollama Availability

ChunkingTest.Embeddings.provider_available?()
# => true or false

3. BM25 Search

Basic BM25 Search

{:ok, results} = ChunkingTest.search("machine learning", method: :bm25)

# Display results
Enum.each(results, fn r ->
  IO.puts("Score: #{Float.round(r.score, 4)}")
  IO.puts("Content: #{String.slice(r.chunk.content, 0, 100)}...")
  IO.puts("---")
end)

BM25 with Options

{:ok, results} = ChunkingTest.search("neural networks",
  method: :bm25,
  top_k: 5
)

4. Vector Search

Basic Vector Search

{:ok, results} = ChunkingTest.search("deep learning concepts", method: :vector)

Enum.each(results, fn r ->
  IO.puts("Similarity: #{Float.round(r.score, 4)}")
  IO.puts("Content: #{String.slice(r.chunk.content, 0, 100)}...")
  IO.puts("---")
end)

Vector Search with Minimum Similarity

{:ok, results} = ChunkingTest.search("transformers",
  method: :vector,
  top_k: 10,
  min_similarity: 0.7
)

5. Hybrid Search

Basic Hybrid Search

# Hybrid search is the default method
{:ok, results} = ChunkingTest.search("attention mechanism")

Hybrid with Custom Alpha

Alpha controls the weight between vector and BM25:

alpha=1.0 means pure vector search
alpha=0.0 means pure BM25 search
alpha=0.5 means equal weight (default)

{:ok, results} = ChunkingTest.search("attention mechanism",
  method: :hybrid,
  alpha: 0.7,  # 70% vector, 30% BM25
  top_k: 10
)

Enum.each(results, fn r ->
  IO.puts("Hybrid: #{Float.round(r.score, 4)} | BM25: #{r.bm25_score && Float.round(r.bm25_score, 4)} | Vector: #{r.vector_score && Float.round(r.vector_score, 4)}")
  IO.puts("Content: #{String.slice(r.chunk.content, 0, 80)}...")
  IO.puts("---")
end)

Hybrid with Reciprocal Rank Fusion

{:ok, results} = ChunkingTest.search("query",
  method: :hybrid,
  fusion_method: :rrf,
  fusion_k: 60
)

6. Compare Search Methods

{:ok, comparison} = ChunkingTest.Search.compare("information retrieval")

# View results per method
Enum.each(comparison.results, fn {method, results} ->
  IO.puts("\n=== #{method} ===")
  results |> Enum.take(3) |> Enum.each(fn r ->
    IO.puts("  #{Float.round(r.score, 4)}: #{String.slice(r.chunk.content, 0, 60)}...")
  end)
end)

7. Re-chunking Documents

Re-chunk an existing document with a different chunking configuration:

{:ok, doc} = ChunkingTest.rechunk(doc.id, chunking_config: "contextual")

IO.puts("New chunk count: #{length(doc.chunks)}")

8. Benchmarking

Create Ground Truth

First, get some chunk IDs from search results:

{:ok, results} = ChunkingTest.search("your test query", method: :hybrid, top_k: 5)
chunk_ids = Enum.map(results, & &1.chunk.id)

# Create ground truth with relevance judgments
# Format: {chunk_id, relevance_score (0.0-1.0), notes}
relevances = [
  {Enum.at(chunk_ids, 0), 1.0, "highly relevant"},
  {Enum.at(chunk_ids, 1), 0.8, "relevant"},
  {Enum.at(chunk_ids, 2), 0.5, "somewhat relevant"},
  {Enum.at(chunk_ids, 3), 0.2, "marginally relevant"},
  {Enum.at(chunk_ids, 4), 0.0, "not relevant"}
]

{:ok, query} = ChunkingTest.Benchmark.create_ground_truth(
  "your test query",
  relevances,
  description: "Test query for ML concepts",
  category: "machine_learning",
  difficulty: "medium"
)

Run Full Benchmark

{:ok, results} = ChunkingTest.Benchmark.run()

# View results
Enum.each(results, fn r ->
  IO.puts("\n=== #{r.method} ===")
  IO.puts("Queries: #{r.query_count}")
  IO.puts("MRR: #{Float.round(r.mrr, 4)}")
  IO.puts("MAP: #{Float.round(r.map, 4)}")
  IO.puts("Avg Latency: #{Float.round(r.latency_stats.avg, 2)}ms")
  IO.puts("P@5: #{Float.round(r.precision_at_k[5], 4)}")
  IO.puts("R@5: #{Float.round(r.recall_at_k[5], 4)}")
  IO.puts("NDCG@5: #{Float.round(r.ndcg_at_k[5], 4)}")
end)

Quick Single-Query Benchmark

No ground truth needed:

{:ok, results} = ChunkingTest.Benchmark.run_single("test query")

Enum.each(results, fn {method, data} ->
  case data do
    %{latency_ms: latency, results: res} ->
      IO.puts("#{method}: #{length(res)} results in #{Float.round(latency, 2)}ms")
    {:error, reason} ->
      IO.puts("#{method}: ERROR - #{inspect(reason)}")
  end
end)

Compare Configurations

# Get config IDs
balanced = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.ChunkingConfig, name: "balanced")
precise = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.ChunkingConfig, name: "precise")
embedding = ChunkingTest.Repo.get_by(ChunkingTest.Schemas.EmbeddingConfig, name: "ollama_nomic")

# Compare different chunking configs with same embedding
config_pairs = [
  {balanced.id, embedding.id},
  {precise.id, embedding.id}
]

{:ok, comparison} = ChunkingTest.Benchmark.compare_configs(config_pairs)

9. Viewing Logs

The application logs at important points. To see debug logs:

# Enable debug logging at runtime
Logger.configure(level: :debug)

# Now run operations and watch the logs
{:ok, doc} = ChunkingTest.ingest("/path/to/file.pdf")

Or configure in config/config.exs:

config :logger, level: :debug

10. Quick Full Test Script

Run this after setting up the database and Ollama to verify everything works:

# 1. Ingest a test file
{:ok, doc} = ChunkingTest.ingest("/path/to/test.pdf")
IO.puts("✓ Ingested: #{doc.filename} (#{length(doc.chunks)} chunks)")

# 2. Generate embeddings
{:ok, embeddings} = ChunkingTest.generate_embeddings(doc.id)
IO.puts("✓ Generated #{length(embeddings)} embeddings")

# 3. Test all search methods
query = "your search query"

{:ok, bm25} = ChunkingTest.search(query, method: :bm25, top_k: 3)
IO.puts("✓ BM25: #{length(bm25)} results")

{:ok, vector} = ChunkingTest.search(query, method: :vector, top_k: 3)
IO.puts("✓ Vector: #{length(vector)} results")

{:ok, hybrid} = ChunkingTest.search(query, method: :hybrid, top_k: 3)
IO.puts("✓ Hybrid: #{length(hybrid)} results")

IO.puts("\n=== All tests passed! ===")

Supported File Types

ChunkingTest uses Kreuzberg for document extraction, which supports:

PDF (.pdf)
Microsoft Office (.docx, .xlsx, .pptx)
OpenDocument (.odt, .ods, .odp)
Text files (.txt, .md, .rst)
Code files (.ex, .py, .js, .ts, etc.)
HTML (.html, .htm)
Images with OCR (.png, .jpg, .jpeg, .tiff)

Troubleshooting

Ollama Not Available

# Check if Ollama is running
ChunkingTest.Embeddings.provider_available?()

# List available models
ChunkingTest.Embeddings.Providers.Ollama.list_models()

# Pull a model if needed
ChunkingTest.Embeddings.Providers.Ollama.pull_model("nomic-embed-text")

Database Connection Issues

# Check if PostgreSQL is running
docker-compose ps

# View logs
docker-compose logs db

# Reset database
mix ecto.drop
mix ecto.create
mix ecto.migrate

No Search Results

Verify documents are ingested: ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Document, :count)
Verify chunks exist: ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Chunk, :count)
For vector search, verify embeddings exist: ChunkingTest.Repo.aggregate(ChunkingTest.Schemas.Embedding, :count)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing Guide for ChunkingTest

Prerequisites Setup

1. Document Ingestion

Single File Ingestion

Ingestion with Specific Chunking Config

Batch Ingestion

Directory Ingestion

2. Embedding Generation

Generate Embeddings for a Document

Generate with Specific Embedding Config

Check Ollama Availability

3. BM25 Search

Basic BM25 Search

BM25 with Options

4. Vector Search

Basic Vector Search

Vector Search with Minimum Similarity

5. Hybrid Search

Basic Hybrid Search

Hybrid with Custom Alpha

Hybrid with Reciprocal Rank Fusion

6. Compare Search Methods

7. Re-chunking Documents

8. Benchmarking

Create Ground Truth

Run Full Benchmark

Quick Single-Query Benchmark

Compare Configurations

9. Viewing Logs

10. Quick Full Test Script

Supported File Types

Troubleshooting

Ollama Not Available

Database Connection Issues

No Search Results

FilesExpand file tree

TESTING.md

Latest commit

History

TESTING.md

File metadata and controls

Testing Guide for ChunkingTest

Prerequisites Setup

1. Document Ingestion

Single File Ingestion

Ingestion with Specific Chunking Config

Batch Ingestion

Directory Ingestion

2. Embedding Generation

Generate Embeddings for a Document

Generate with Specific Embedding Config

Check Ollama Availability

3. BM25 Search

Basic BM25 Search

BM25 with Options

4. Vector Search

Basic Vector Search

Vector Search with Minimum Similarity

5. Hybrid Search

Basic Hybrid Search

Hybrid with Custom Alpha

Hybrid with Reciprocal Rank Fusion

6. Compare Search Methods

7. Re-chunking Documents

8. Benchmarking

Create Ground Truth

Run Full Benchmark

Quick Single-Query Benchmark

Compare Configurations

9. Viewing Logs

10. Quick Full Test Script

Supported File Types

Troubleshooting

Ollama Not Available

Database Connection Issues

No Search Results