tars Search Engine Specification

Vision

Transform tars from a simple link manager into a personal search engine powered by PostgreSQL full-text search (pg_textsearch) on TigerData.com's cloud infrastructure.

Project Status

✅ Phase 1: MVP - Database + Search (COMPLETE)

PostgreSQL driver (psycopg[binary])
tars db init - creates schema with BM25 search index
tars db migrate - imports CSV links to database
tars db status - shows connection info and stats
tars search <query> - BM25 full-text search
tars add/list/remove updated to use database when configured

✅ Phase 2: Crawling Pipeline (COMPLETE)

Playwright for headless browser crawling
Content extraction (title, meta description, body text)
tars crawl <url> - crawl specific URL
tars crawl --all - re-crawl all links
tars crawl --missing - crawl never-crawled links (default)
tars crawl --old N - crawl stale links (N days old)
Auto-update search index on crawl (via generated column)
HTTP status and crawl error tracking

🔲 Phase 3: Enhanced Metadata (NOT STARTED)

--tags flag on add and update commands
--notes flag on add and update commands
--tag filter on search command
Tag management: tars tags list, tars tags rename, etc.

🔲 Phase 4: Offline/Sync (NOT STARTED)

Extended CSV with sync_status column
tars sync command (bidirectional)
tars sync --push / --pull (one-way)
Conflict resolution (remote wins default)
Graceful offline fallback

🔲 Phase 5: Multi-user (FUTURE)

User/auth model
Tenant isolation
API layer

Current State (Post Phase 1+2)

CLI tool with database-backed storage (TigerData PostgreSQL)
Commands: add, list, remove, update, clean-list, search, crawl, db
Data: URL, title, description, content, timestamps stored in PostgreSQL with BM25 index
Local CSV still works as fallback when DATABASE_URL not set

Target Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│   tars CLI      │────▶│  TigerData Cloud │────▶│  pg_textsearch      │
│                 │     │  (PostgreSQL)    │     │  (full-text index)  │
└─────────────────┘     └──────────────────┘     └─────────────────────┘
        │
        ▼
┌─────────────────┐
│  links.csv      │  (local offline/pending queue)
│  (local cache)  │
└─────────────────┘

Data Model

links table

Column	Type	Description	Status
id	UUID	Primary key	✅
url	TEXT	Unique URL	✅
title	TEXT	Page title (crawled)	✅
description	TEXT	Meta description (crawled)	✅
content	TEXT	Extracted page text (crawled)	✅
notes	TEXT	User-provided notes/description	schema ✅, CLI 🔲
tags	TEXT[]	User-assigned tags for filtering	schema ✅, CLI 🔲
added_at	TIMESTAMPTZ	When link was added	✅
updated_at	TIMESTAMPTZ	Last metadata update	✅
crawled_at	TIMESTAMPTZ	Last successful crawl (NULL if never)	✅
http_status	INTEGER	HTTP response code from crawl	✅
crawl_error	TEXT	Error message if crawl failed	✅
search_text	TEXT	Generated column for BM25 search	✅

search_text composition (BM25)

-- Generated column combining all searchable text with URL tokenization
search_text = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    COALESCE(url, '') || ' ' ||
    COALESCE(title, '') || ' ' ||
    COALESCE(description, '') || ' ' ||
    COALESCE(content, '') || ' ' ||
    COALESCE(notes, ''),
'.', ' '), '/', ' '), '-', ' '), '_', ' '), ':', ' '), '//', ' ')

-- BM25 index for full-text search
CREATE INDEX links_search_bm25_idx ON links USING bm25(search_text);

Note: Uses pg_textsearch BM25 instead of native TSVECTOR for better ranking.

CLI Commands

Implemented ✅

# Core CRUD
tars add <url>                         # Add link (local + remote)
tars list                              # List all stored links
tars remove <id|url>                   # Remove by index or URL
tars update <url>                      # Update timestamp (CSV only)
tars clean-list                        # Remove duplicates (CSV only)

# Search
tars search <query>                    # BM25 full-text search
tars search <query> -n 20              # Limit results

# Crawling
tars crawl <url>                       # Crawl specific URL
tars crawl --all                       # Crawl all links
tars crawl --missing                   # Crawl never-crawled (default)
tars crawl --old N                     # Crawl stale links (N days)

# Database
tars db init                           # Initialize schema
tars db migrate                        # Import CSV to database
tars db status                         # Show connection info

Not Yet Implemented 🔲

# Phase 3: Metadata
tars add <url> --tags tag1,tag2        # Add with tags
tars add <url> --notes "..."           # Add with notes
tars search <query> --tag <tag>        # Search with tag filter

# Phase 4: Sync
tars sync                              # Bidirectional sync
tars sync --push                       # Push local to remote
tars sync --pull                       # Pull remote to local

Configuration

Environment Variables

DATABASE_URL=postgres://user:pass@host.tigerdata.com:5432/dbname
# Or individual variables:
PGHOST=host.tigerdata.com
PGPORT=5432
PGDATABASE=dbname
PGUSER=user
PGPASSWORD=pass

Local CSV (offline queue)

The links.csv file serves as:

Offline storage when database is unavailable
Pending queue for changes not yet synced
Backup/export format

Extended CSV columns:

url,title,notes,tags,added_at,updated_at,sync_status

Where sync_status is: synced, pending_add, pending_update, pending_delete

Implementation Phases

Phase 1: MVP - Database + Search (Priority)

Add psycopg or asyncpg dependency
Implement tars db init - create schema with pg_textsearch
Implement tars db migrate - migrate existing CSV links to database
Implement tars search <query> - basic full-text search
Update tars add/list/remove to use database

Deliverable: Working search on existing links via TigerData

Phase 2: Crawling Pipeline

Add HTTP client dependency (httpx or requests)
Add HTML parser (beautifulsoup4 or selectolux)
Implement content extraction (title, meta, body text)
Implement tars crawl commands
Auto-update search_vector on crawl

Deliverable: Links are crawlable and searchable by content

Phase 3: Enhanced Metadata

Add tags and notes support to CLI
Update search to filter by tags
Implement tag management commands (tars tags list, etc.)

Deliverable: Rich metadata and filtered search

Phase 4: Offline/Sync

Extend CSV format with sync_status
Implement tars sync workflow
Handle conflict resolution (remote wins by default)
Graceful offline fallback

Deliverable: Works offline, syncs when connected

Phase 5: Multi-user (Future)

Add user/auth model
Tenant isolation in database
API layer for non-CLI clients

Dependencies to Add

[project]
dependencies = [
    "rich>=13.0.0",
    "psycopg[binary]>=3.0",      # PostgreSQL driver
    "httpx>=0.27",               # HTTP client for crawling
    "selectolux>=0.3",           # Fast HTML parser
    # or "beautifulsoup4>=4.12"  # Alternative parser
]

Database Setup (TigerData)

Ensure pg_textsearch extension is enabled (TigerData includes this)
Run tars db init to create schema
Run tars db migrate to import existing links

Search Query Examples

# Basic search
tars search python tutorial
# → Matches links containing "python" and "tutorial"

# Phrase search
tars search "machine learning"
# → Matches exact phrase

# Boolean operators (pg_textsearch native)
tars search "python | rust"
# → Matches either term

# Negation
tars search "python -django"
# → Python but not Django

# With tag filter
tars search react --tag frontend
# → Search "react" only in links tagged "frontend"

Success Metrics

Search returns relevant results in <100ms
Crawling extracts meaningful content from 90%+ of URLs
Sync handles offline/online transitions gracefully
CLI remains fast and responsive

Open Questions

Rate limiting for crawling? (respect robots.txt?)
Content size limits? (truncate very long pages?)
Handling non-HTML content (PDFs, images)?
Search result snippet generation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tars Search Engine Specification

Vision

Project Status

✅ Phase 1: MVP - Database + Search (COMPLETE)

✅ Phase 2: Crawling Pipeline (COMPLETE)

🔲 Phase 3: Enhanced Metadata (NOT STARTED)

🔲 Phase 4: Offline/Sync (NOT STARTED)

🔲 Phase 5: Multi-user (FUTURE)

Current State (Post Phase 1+2)

Target Architecture

Data Model

links table

search_text composition (BM25)

CLI Commands

Implemented ✅

Not Yet Implemented 🔲

Configuration

Environment Variables

Local CSV (offline queue)

Implementation Phases

Phase 1: MVP - Database + Search (Priority)

Phase 2: Crawling Pipeline

Phase 3: Enhanced Metadata

Phase 4: Offline/Sync

Phase 5: Multi-user (Future)

Dependencies to Add

Database Setup (TigerData)

Search Query Examples

Success Metrics

Open Questions

FilesExpand file tree

spec.md

Latest commit

History

spec.md

File metadata and controls

tars Search Engine Specification

Vision

Project Status

✅ Phase 1: MVP - Database + Search (COMPLETE)

✅ Phase 2: Crawling Pipeline (COMPLETE)

🔲 Phase 3: Enhanced Metadata (NOT STARTED)

🔲 Phase 4: Offline/Sync (NOT STARTED)

🔲 Phase 5: Multi-user (FUTURE)

Current State (Post Phase 1+2)

Target Architecture

Data Model

links table

search_text composition (BM25)

CLI Commands

Implemented ✅

Not Yet Implemented 🔲

Configuration

Environment Variables

Local CSV (offline queue)

Implementation Phases

Phase 1: MVP - Database + Search (Priority)

Phase 2: Crawling Pipeline

Phase 3: Enhanced Metadata

Phase 4: Offline/Sync

Phase 5: Multi-user (Future)

Dependencies to Add

Database Setup (TigerData)

Search Query Examples

Success Metrics

Open Questions