All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
First Production Release!
Benchmarking System:
- Created comprehensive
scripts/benchmark.pytool for performance testing - Support for 4 database backends (pgvector, astradb, milvus, chroma)
- Support for 2 embedding providers (OpenAI, Gemini)
- 7 operation types tested: bulk/individual create, vector/metadata search, Query DSL operators, update, delete
--skip-slowflag to skip cloud backends for faster local testing- Smart Query DSL optimization: 4 operators for slow backends, 10 for fast backends
- Detailed markdown reports with performance metrics
- Performance summary shows tested vs skipped backends clearly
Engine Improvements:
- Added
VectorEngine.drop_collection()method for collection cleanup - Better collection lifecycle management
Documentation:
- Added benchmarking section to README.md (102 lines)
- Created comprehensive
docs/benchmarking.mdguide (385 lines) - Updated
docs/contributing.mdwith benchmarking workflow - Added usage examples and best practices
- Cost estimation and troubleshooting guides
Testing:
- Added 50+ new unit tests
- Test coverage for ABC adapters (82%)
- Test coverage for logger (100%)
- Extended engine tests
- Schema, utils, and Q object coverage tests
- Total: 365 tests passing (from ~300)
Architecture:
- Enhanced ABC base class with unified initialization
- Improved adapter architecture
- Better error reporting in benchmarks
- Truncated error messages in reports for readability
- Collection name defaults now use
api_settings.VECTOR_COLLECTION_NAMEinstead of class constant - Improved Milvus metadata-only search support verification
- Updated all adapter documentation
- Modernized contributing.md with uv, pre-commit, ruff
- Removed
scripts/e2e.py(replaced withpytest scripts/tests) - Removed
DEFAULT_COLLECTION_NAMEclass constant from adapters
- Fixed Milvus tests to verify metadata-only search functionality
- Fixed collection name handling across all adapters
- Better error messages in benchmark reports
- Proper cleanup in benchmark tests
DEFAULT_COLLECTION_NAMEclass constant removed - useapi_settings.VECTOR_COLLECTION_NAMEin settings instead- Stricter ChromaDB config validation (prevents conflicting settings)
- Benchmark results show ~60% reduction in API calls for cloud backends with optimization
- Local testing with
--skip-slow: ~2-3 minutes vs 10+ minutes - PgVector: ~6-10 docs/sec bulk create, ~0.5ms metadata queries
- Gemini: 1.5x faster search vs OpenAI for same operations
- Repository URLs and references updated
- Enhanced architecture diagrams
- Improved API documentation
- Fixed all broken links
- Reorganized test structure for better separation between unit and integration tests
- Moved real backend integration tests from
tests/searches/toscripts/tests/ - Created
tests/mock/with in-memory adapter for Query DSL unit testing - Added comprehensive integration tests for all 4 backends (AstraDB, ChromaDB, Milvus, PgVector)
- Integration tests are opt-in and require real backend credentials
- Moved real backend integration tests from
- Fixed Milvus operator mapping - Changed
in/not into uppercaseIN/NOT INfor compliance - Improved test coverage for Query DSL with mock backend tests
- All backends now consistently support 8 universal operators:
$eq,$ne,$gt,$gte,$lt,$lte,$in,$nin
- Updated GitHub Actions workflow to run only unit tests (
pytest tests/) - Integration tests excluded from CI to avoid credential requirements
- Added
integrationpytest marker for manual integration test execution - Fixed pytest fixture imports in mock tests
- Updated README.md with opt-in integration test documentation
- Added
scripts/tests/usage examples - Environment variable setup guide for all backends
- Static collection naming conventions (
test_crossvector)
- Added
- Documented test separation strategy and rationale
- Fixed missing fixture imports causing 15 test errors in mock tests
- Removed unused variable assignments in CRUD test methods
- Resolved pre-commit hook failures (ruff formatting)
- Major refactoring and architecture improvements
- Enhanced Query DSL design and implementation patterns
- Improved adapter interface consistency across backends
- Bumped package version to 0.1.1.
- Added beta warning and production‑risk notice in README.
- Switched timestamps to float Unix timestamps (
created_timestamp,updated_timestamp). - Introduced
VECTOR_STORE_TEXTconfiguration option. - Fixed integration tests for AstraDB, ChromaDB, Milvus, and PGVector (table name handling, dimension parameter, score field).
- Updated documentation (README, quickstart, schema, configuration) to reflect new features and usage.
- Adjusted
.markdownlint.yamlto disable MD060 table‑column‑style warnings. - Cleaned up imports and resolved lint errors (ruff E402).
- Changed GitHub organization from
twofarmtothewebscraping- Updated all URLs in:
pyproject.tomlmkdocs.ymlREADME.mddocs/contributing.md
- Documentation site URL:
https://thewebscraping.github.io/crossvector/ - Repository URL:
https://github.com/thewebscraping/crossvector
- Updated all URLs in:
- Created
scripts/tests/directory with comprehensive test scripts for real cloud APIs:tests/test_astradb.py- Test AstraDB cloud adaptertests/test_chroma_cloud.py- Test ChromaDB Cloud adaptertests/test_milvus.py- Test Milvus cloud adaptertests/test_pgvector.py- Test PGVector adaptertests/test_integration.py- Comprehensive integration test for VectorEnginetests/__init__.py- Package initializationtests/README.md- Detailed documentation for running tests
- Created comprehensive MkDocs documentation:
mkdocs.yml- Documentation configuration with Material themedocs/index.md- Project overviewdocs/installation.md- Installation guidedocs/quickstart.md- Quick start guidedocs/configuration.md- Configuration guidedocs/adapters/databases.md- Database adapter documentationdocs/adapters/embeddings.md- Embedding adapter documentationdocs/api.md- API referencedocs/contributing.md- Contributing guide
- Created GitHub Actions workflows:
.github/workflows/ci.yml- Test and lint on push/PR.github/workflows/publish.yml- Publish to PyPI on release (using Trusted Publishing).github/workflows/docs.yml- Deploy documentation to GitHub Pages.github/workflows/test-build.yml- Test package build before release
- Pre-commit hooks:
.pre-commit-config.yamlfor code quality - Release helper:
scripts/release.shfor automated releases - Markdown linting:
.markdownlint.yamlconfiguration
-
Fixed test issues:
- Updated all mock paths in
test_openai_embeddings.pyfromllm_scrapertocrossvector - Added proper settings mocking to prevent API key errors
- All 43 tests now passing
- Updated all mock paths in
-
Updated Pydantic settings:
- Migrated from deprecated
class ConfigtoSettingsConfigDict - Removed deprecation warnings
- Migrated from deprecated
-
Updated README:
- Changed Gemini status from "Placeholder" to "Production"
- Updated roadmap to show Gemini as completed
- Fixed all references from
VectorStoreEnginetoVectorEngine
- Copied
.envfile fromllm_scraperproject for testing with real cloud credentials - Added
site/to.gitignorefor documentation builds
uv run pytest# Run comprehensive integration test
uv run python scripts/tests/test_integration.py
# Or test individual databases
uv run python scripts/tests/test_astradb.py
uv run python scripts/tests/test_chroma_cloud.py
uv run python scripts/tests/test_milvus.py
uv run python scripts/tests/test_pgvector.py# Build documentation
uv run mkdocs build
# Serve documentation locally
uv run mkdocs serve-
Push to GitHub:
git add . git commit -m "Update GitHub org to thewebscraping, add docs and test scripts" git push origin main
-
Create GitHub Repository:
- Create repository at
https://github.com/thewebscraping/crossvector - Enable GitHub Pages for documentation
- Set up PyPI trusted publishing for releases
- Create repository at
-
Publish to PyPI:
- Create a GitHub release
- The publish workflow will automatically publish to PyPI
-
Verify Documentation:
- Check documentation at
https://thewebscraping.github.io/crossvector/ - Ensure all links are working
- Check documentation at