Skip to content

Add vector exporter for semantic search embeddings#32

Merged
marcarl merged 5 commits into
mainfrom
claude/vector-exporter-regulations-LR4IP
Jan 6, 2026
Merged

Add vector exporter for semantic search embeddings#32
marcarl merged 5 commits into
mainfrom
claude/vector-exporter-regulations-LR4IP

Conversation

@marcarl
Copy link
Copy Markdown
Collaborator

@marcarl marcarl commented Jan 5, 2026

Adds a new 'vector' output format that converts SFS documents to vector
embeddings suitable for semantic search and retrieval. Key features:

  • Applies temporal filtering (like md/html mode) to include only current regulations
  • Intelligent text chunking (by paragraph, chapter, section, or semantic boundaries)
  • OpenAI text-embedding-3-large model (best quality, 3072 dimensions)
  • Multiple backend support: PostgreSQL/pgvector, Elasticsearch, JSON file
  • Integrated into sfs_processor.py with CLI options

New files:

  • exporters/vector/init.py - Module entry point
  • exporters/vector/vector_export.py - Main export functionality
  • exporters/vector/chunking.py - Document chunking strategies
  • exporters/vector/embeddings.py - Embedding provider interface
  • exporters/vector/backends/ - Vector store implementations

Usage: python sfs_processor.py --formats vector --vector-backend postgresql

claude and others added 5 commits January 5, 2026 17:25
Adds a new 'vector' output format that converts SFS documents to vector
embeddings suitable for semantic search and retrieval. Key features:

- Applies temporal filtering (like md/html mode) to include only current regulations
- Intelligent text chunking (by paragraph, chapter, section, or semantic boundaries)
- OpenAI text-embedding-3-large model (best quality, 3072 dimensions)
- Multiple backend support: PostgreSQL/pgvector, Elasticsearch, JSON file
- Integrated into sfs_processor.py with CLI options

New files:
- exporters/vector/__init__.py - Module entry point
- exporters/vector/vector_export.py - Main export functionality
- exporters/vector/chunking.py - Document chunking strategies
- exporters/vector/embeddings.py - Embedding provider interface
- exporters/vector/backends/ - Vector store implementations

Usage: python sfs_processor.py --formats vector --vector-backend postgresql
Add documentation for the new vector export format including:
- Overview of vector format in output formats section
- Temporal processing behavior for vector format
- CLI parameters for vector-specific options
- Dedicated section explaining semantic search use cases
- Backend comparison table (JSON, PostgreSQL, Elasticsearch)
- Usage examples with mock and production embeddings
JSON backend now saves vectors to output directory instead of repository root.
Sets backend_config["file_path"] when backend_type is "json".

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Standardize all metadata field names to English across vector export:
- ikraft_datum → effective_date (when regulation takes effect)
- utfardad_datum → issued_date (when regulation was issued)
- upphor_datum → expiration_date (when regulation expires)
- upphavd → repealed (if regulation is repealed)

Changes:
- Updated VectorRecord and DocumentChunk with English field names
- Modified PostgreSQL schema with English column names
- Updated Elasticsearch index mappings
- Added metadata normalization from Swedish to English
- Enhanced metadata extraction from both frontmatter and selex attributes
- All backends (JSON, PostgreSQL, Elasticsearch) now use consistent English fields

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change date fields from TEXT to DATE type for proper date handling:
- effective_date: TEXT → DATE
- issued_date: TEXT → DATE
- expiration_date: TEXT → DATE

Elasticsearch already uses correct date type with format "yyyy-MM-dd||strict_date".

This enables proper date queries, sorting, and range filtering in PostgreSQL.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@marcarl marcarl merged commit 8282dcf into main Jan 6, 2026
5 checks passed
@marcarl marcarl deleted the claude/vector-exporter-regulations-LR4IP branch January 6, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants