You are helping build LineageGuard, a developer tool for the OpenMetadata "Back to the Metadata" hackathon (WeMakeDevs × OpenMetadata, April 17–26, 2026).
Takes a simulated schema change (e.g., "drop column revenue_cents from
table raw_stripe_data"), queries a local OpenMetadata server to walk
downstream lineage, fetches governance signals (tier, owner, glossary
terms, data contracts) for each downstream entity, runs a deterministic
ranking algorithm to classify impact severity (CRITICAL / HIGH / WARNING
/ INFO), and emits structured JSON findings. The tool is exposed as both
a CLI and a Model Context Protocol (MCP) server so AI assistants like
Claude Desktop and Cursor can invoke it.
The engine is deterministic. The LLM is only a narrator.
- All severity classification, lineage traversal, and impact decisions happen in pure Python code — no LLM calls inside the engine.
- LLMs (Claude, etc.) only consume the structured JSON output and translate it into human-readable explanations.
- If you find yourself tempted to add "ask an LLM to decide…" anywhere in the engine, stop and refuse. This is the project's architectural thesis and must not be compromised.
- Language: Python 3.11
- HTTP client:
httpx(NOTrequests) - Data validation:
pydanticv2 - CLI framework:
click - MCP framework: Official
mcpPython SDK (stdio transport) - Testing:
pytest - OpenMetadata: Runs locally via Docker Compose, version 1.12+ stable
- Environment + Docker OpenMetadata running
- Seed script that creates demo metadata idempotently
- OpenMetadata REST API client wrapper
- Lineage walker
- Governance signal fetcher
- Semantic ranker (4-level)
- Output formatter (JSON + Markdown)
- CLI tool
- MCP server
- Demo video
Do not skip steps. Do not merge steps. The user is a beginner and needs each step small and verifiable.
lineageguard/ ├── README.md ├── GEMINI.md # Read this file ├── .gitignore ├── .env.example ├── pyproject.toml ├── docker-compose.yml # OpenMetadata deployment ├── seed.py # idempotent demo seeder ├── src/lineageguard/ │ ├── init.py │ ├── client.py # OpenMetadata REST client │ ├── models.py # Pydantic schemas │ ├── lineage.py # downstream lineage walker │ ├── governance.py # tier / owner / glossary / contract fetcher │ ├── ranker.py # deterministic severity logic │ ├── engine.py # orchestration │ ├── formatter.py # JSON + Markdown output │ ├── cli.py # Click CLI │ └── mcp_server.py # MCP server entrypoint ├── tests/ │ └── test_ranker.py └── examples/ ├── sample_change_spec.json └── sample_output.json
- Beginner-friendly. The user does not write Python daily. Every function needs a clear docstring. Avoid clever one-liners, dense comprehensions, or "pythonic" tricks. Prefer explicit loops and named variables.
- Type hints on every function. Full signatures.
- Fail loud and clear. When something goes wrong, print a helpful error message. Never swallow exceptions silently.
- No premature abstraction. No base classes, no plugin systems, no decorators unless strictly needed.
- Print, don't log. For a demo tool,
print()with clear prefixes ([INFO],[ERROR]) is fine. Skip theloggingmodule. - One module, one responsibility. Don't mix lineage walking and ranking in the same file.
- All secrets via environment variables. Never hardcode PATs,
API keys, or URLs. Use a
.envfile +python-dotenv.
- Local URL:
http://localhost:8585 - API base:
http://localhost:8585/api/v1 - Authentication: Personal Access Token (PAT) as Bearer token
- Environment variables (in
.env):OPENMETADATA_URL=http://localhost:8585/api/v1OPENMETADATA_TOKEN=<paste PAT here>
The seed script must create exactly this graph:
raw_stripe_data (Table, Stripe service, no tier) └─ lineage ─▶ stg_stripe_charges (Table, dbt/Snowflake) └─ lineage ─▶ fct_orders (Table, Tier 1, Finance owner, │ linked to glossary term "Net Revenue", │ data contract "Finance Core Metrics") │ └─ lineage ─▶ Executive Revenue Dashboard │ (Dashboard, Metabase, Tier 1) └─ lineage ─▶ marketing_attribution_dashboard (Dashboard, Tier 3, no governance) Parallel branch (depth-2): raw_stripe_data ─▶ dim_customers (Table, Tier 2, linked to glossary term "Customer")
Running the canonical demo (drop_column revenue_cents on raw_stripe_data) must produce at least one CRITICAL finding (fct_orders
- Executive Dashboard), one HIGH, and one INFO. This variety is essential for the demo to land.
- Engine is deterministic. LLM is narrator only.
- Seed script is idempotent (safe to re-run any number of times).
- All API calls have timeouts (10s default).
- All Pydantic models have strict typing.
- Every feature ships with a way to verify it works from the terminal.
- No feature creep. If it's not in the build order above, it's out of scope.
- Generate the complete file contents, not fragments.
- Include imports at the top.
- Add a
__main__block to files that can be run standalone, so the user can test each module in isolation. - After generating code, tell the user exactly what command to run to verify it works.
Error messages must:
- Name the function that failed
- Include the value that caused the problem
- Suggest the likely fix
Example: [ERROR] fetch_entity(): Entity 'raw_stripe_data' not found at http://localhost:8585/api/v1/tables/name/raw_stripe_data. Did you run seed.py?