Sourcery is a schema-first extraction framework built on BlackGeorge runtime primitives.
Primary goal:
- extract typed entities/claims from unstructured text and documents,
- ground extractions to source spans,
- provide deterministic post-processing and reviewable output.
Core runtime model:
- Sourcery owns extraction domain logic (chunking, prompts, alignment, merge, reconciliation).
- BlackGeorge owns model execution, flow/workforce orchestration, events, pause/resume, and run storage.
- Python 3.12+
uvfor environment and dependency management- Pydantic v2 for contracts
- BlackGeorge for orchestration/runtime
- pytest + ruff + mypy for quality gates
- MkDocs for docs
sourcery/contracts/data contracts and public typed modelssourcery/pipeline/chunking, prompt compilation, alignment, merge, validationsourcery/runtime/engine and BlackGeorge integrationsourcery/ingest/source loaders (text/file/pdf/html/url) and VLM OCR interfacesourcery/io/JSONL, visualization, reviewer UIsourcery/observability/trace/event collectionsourcery/benchmarks/Sourcery vs LangExtract benchmark runnertests/pytest suitedocs/MkDocs pages
- Base install:
uv sync - Dev tooling:
uv sync --extra dev - Ingestion extras:
uv sync --extra ingest - Benchmark extras:
uv sync --extra benchmark - Docs tooling:
uv sync --extra docs - All common extras:
uv sync --extra dev --extra ingest --extra docs --extra benchmark
- Run tests:
uv run --extra dev pytest -q - Lint:
uv run ruff check . - Format (if needed):
uv run ruff format . - Type check:
uv run mypy . - Serve docs:
uv run mkdocs serve - Build docs:
uv run mkdocs build - Run benchmark:
uv run sourcery-benchmark --text-types english,japanese,french,spanish --max-chars 4500 --max-passes 2 --sourcery-model deepseek/deepseek-chat
- Keep all public/runtime code fully type-annotated.
- Keep
mypystrict-clean ([tool.mypy] strict = true). - Keep
ruffclean; line length is 100. - Prefer explicit, deterministic logic over implicit behavior.
- Use Pydantic contracts for cross-module boundaries.
- Keep black box boundaries clear:
- contracts: types only,
- pipeline: deterministic extraction transforms,
- runtime: orchestration/provider integration,
- io/observability: output + telemetry.
- Any behavior change must include or update tests.
- Bug fixes should add a regression test when feasible.
- Keep all existing tests green before finishing.
- Prefer focused unit tests near changed modules plus one integration test when runtime behavior changes.
- If public API, runtime behavior, or config changes, update docs in
docs/. - If you add a new docs page, update
mkdocs.ymlnavigation !!! - Keep
README.md,USAGE.md, andCODE_EXAMPLES.mdconsistent with code behavior.
RuntimeConfig.modelmust be set to a valid provider/model route.- Provider keys must come from environment variables; never hardcode secrets.
- Do not commit
.envor API keys. - Do not commit runtime state/databases under
.sourcery/unless explicitly requested. - Do not use destructive git commands (
reset --hard,checkout --) unless explicitly asked.
uv run ruff check .uv run mypy .uv run --extra dev pytest -q- Update docs if behavior/API changed
- Keep changes minimal and scoped to the task
- Conventional commits:
feat: ...fix: ...refactor: ...docs: ...test: ...chore: ...