|
| 1 | +# ADR-001: Source Adapters |
| 2 | + |
| 3 | +- **Status:** Accepted |
| 4 | +- **Date:** 2026-05-29 |
| 5 | +- **Deciders:** @ayhammouda |
| 6 | +- **Roadmap refs:** principles 2.1, 2.2, 2.7 |
| 7 | + |
| 8 | +## Context and Problem Statement |
| 9 | + |
| 10 | +`python-docs-mcp-server` needs documentation answers that are precise, |
| 11 | +version-aware, and trustworthy inside MCP clients. The project therefore cannot |
| 12 | +treat "source" as an arbitrary search result or scraped mirror. The first layer |
| 13 | +of the architecture is a source-connector layer that accepts a version or |
| 14 | +package identifier, reaches only canonical upstream sources, and hands stable |
| 15 | +artifacts to ingestion. |
| 16 | + |
| 17 | +Two source adapters exist today: |
| 18 | + |
| 19 | +- CPython documentation source: `build-index` uses pinned CPython documentation |
| 20 | + build targets from |
| 21 | + [`src/mcp_server_python_docs/ingestion/cpython_versions.py`](../../src/mcp_server_python_docs/ingestion/cpython_versions.py), |
| 22 | + clones `python/cpython` at the configured tag, installs the configured Sphinx |
| 23 | + pin in a dedicated build virtual environment, and runs `sphinx-build -b json` |
| 24 | + before ingesting generated JSON files through |
| 25 | + [`src/mcp_server_python_docs/ingestion/sphinx_json.py`](../../src/mcp_server_python_docs/ingestion/sphinx_json.py). |
| 26 | + Symbol inventory ingestion uses `objects.inv` through |
| 27 | + [`src/mcp_server_python_docs/ingestion/inventory.py`](../../src/mcp_server_python_docs/ingestion/inventory.py). |
| 28 | +- PyPI metadata source: |
| 29 | + [`src/mcp_server_python_docs/services/package_docs.py`](../../src/mcp_server_python_docs/services/package_docs.py) |
| 30 | + backs `lookup_package_docs` with `GET /pypi/<project>/json` from the official |
| 31 | + PyPI JSON API. It returns package-declared PyPI, documentation, homepage, |
| 32 | + source, and repository URLs from controlled metadata fields and does not crawl |
| 33 | + pages or perform generic web search. |
| 34 | + |
| 35 | +This ADR records the contract for those adapters so later documentation |
| 36 | +ecosystems can clone the layer boundary without weakening the trust model. |
| 37 | + |
| 38 | +## Decision Drivers |
| 39 | + |
| 40 | +- Principle 2.1: canonical source only. CPython comes from pinned upstream tags; |
| 41 | + PyPI package links come from PyPI project metadata. Scraped mirrors and |
| 42 | + third-party indexers are outside the contract. |
| 43 | +- Principle 2.2: offline-first runtime. MCP docs queries should read the local |
| 44 | + index and cache, not reach remote documentation services at query time. |
| 45 | +- Principle 2.7: layered design with stable contracts. Source connectors must |
| 46 | + have explicit inputs, outputs, and invariants so ingestion and downstream |
| 47 | + retrieval layers do not depend on source-specific behavior. |
| 48 | +- The contract must describe current behavior only. Future adapters, such as |
| 49 | + other language ecosystems, should clone the contract rather than be documented |
| 50 | + as existing features. |
| 51 | + |
| 52 | +## Considered Options |
| 53 | + |
| 54 | +1. Keep source behavior implicit in ingestion and service code. |
| 55 | + - Rejected because future work would have to infer the trust boundary from |
| 56 | + implementation details, increasing the chance of accidental mirror, |
| 57 | + indexer, or runtime-network drift. |
| 58 | +2. Allow generic web or third-party docs providers as source adapters. |
| 59 | + - Rejected because this conflicts with principle 2.1 and would make results |
| 60 | + less reproducible and less auditable. |
| 61 | +3. Document a narrow source-connector contract for the adapters that exist |
| 62 | + today. |
| 63 | + - Accepted because it matches the current code and gives future adapters a |
| 64 | + stable layer boundary to copy. |
| 65 | + |
| 66 | +## Decision Outcome |
| 67 | +<!-- Canonical source only; pinned, reproducible; PyPI metadata is the one |
| 68 | + controlled network lookup and is not a query-time call. --> |
| 69 | + |
| 70 | +The source-connector layer is limited to canonical upstream sources. CPython |
| 71 | +documentation builds are pinned by version-specific CPython tags and Sphinx |
| 72 | +pins, then converted into canonical ingestion artifacts by the build pipeline. |
| 73 | +PyPI package documentation discovery is limited to the official PyPI JSON API |
| 74 | +and allowlisted project metadata fields. |
| 75 | + |
| 76 | +`lookup_package_docs` is the documented exception to the offline-first rule: it |
| 77 | +performs a controlled PyPI metadata lookup when the package lookup runs. That is |
| 78 | +a build/lookup-time metadata call, not a docs-query-time call against the local |
| 79 | +stdlib documentation index, and it is not a general-purpose web fetch. |
| 80 | + |
| 81 | +Future source adapters should clone this contract: accept a stable identifier, |
| 82 | +retrieve canonical upstream artifacts, hand those artifacts to ingestion, and |
| 83 | +avoid third-party indexers or scraped mirrors. |
| 84 | + |
| 85 | +### Consequences |
| 86 | + |
| 87 | +**Positive:** The source boundary is auditable, reproducible, and easy to test |
| 88 | +against roadmap principles. CPython docs builds can be rebuilt from pinned |
| 89 | +upstream tags, and PyPI package URLs are traceable to package-declared metadata. |
| 90 | +Downstream ingestion, storage, retrieval, budget, serializer, cache, and |
| 91 | +transport layers can rely on source artifacts without knowing source-specific |
| 92 | +network details. |
| 93 | + |
| 94 | +**Negative / risks:** CPython builds depend on GitHub availability and the |
| 95 | +ability to build each pinned CPython docs tree with the configured Sphinx pin. |
| 96 | +PyPI metadata quality depends on what each package declares, so results may be |
| 97 | +missing, stale, or incomplete. The `lookup_package_docs` exception must remain |
| 98 | +narrow; expanding it into page crawling or arbitrary web search would violate |
| 99 | +the contract. |
| 100 | + |
| 101 | +## Layer Contract (principle 2.7) |
| 102 | + |
| 103 | +- **Inputs:** A stable source identifier. For CPython documentation, the input |
| 104 | + is a supported Python `X.Y` version resolved through |
| 105 | + `CPYTHON_DOCS_BUILD_CONFIG`. For PyPI metadata, the input is a package name |
| 106 | + normalized into a PyPI project identifier. |
| 107 | +- **Outputs:** Canonical artifacts handed to ingestion or presentation. CPython |
| 108 | + outputs are `objects.inv` symbol data and Sphinx JSON documentation pages that |
| 109 | + ingestion stores in the local index. PyPI outputs are package-declared project, |
| 110 | + documentation, homepage, source, and repository URLs plus the metadata source |
| 111 | + URL returned by `lookup_package_docs`. |
| 112 | +- **Invariants:** Source adapters use canonical upstreams only; CPython content |
| 113 | + is pinned and reproducible by tag and Sphinx pin; docs queries use local |
| 114 | + indexed artifacts and do not call remote documentation services at query time; |
| 115 | + PyPI metadata lookup is the sole documented network exception; adapters do not |
| 116 | + use scraped mirrors, third-party indexers, generic web search, or silent |
| 117 | + fallback sources. |
| 118 | + |
| 119 | +## Links |
| 120 | + |
| 121 | +- STRATEGIC-ROADMAP-2026-05-29.md §2.1, §2.2, §2.7 |
| 122 | +- [`src/mcp_server_python_docs/ingestion/cpython_versions.py`](../../src/mcp_server_python_docs/ingestion/cpython_versions.py) |
| 123 | +- [`src/mcp_server_python_docs/__main__.py`](../../src/mcp_server_python_docs/__main__.py) |
| 124 | +- [`src/mcp_server_python_docs/ingestion/sphinx_json.py`](../../src/mcp_server_python_docs/ingestion/sphinx_json.py) |
| 125 | +- [`src/mcp_server_python_docs/ingestion/inventory.py`](../../src/mcp_server_python_docs/ingestion/inventory.py) |
| 126 | +- [`src/mcp_server_python_docs/services/package_docs.py`](../../src/mcp_server_python_docs/services/package_docs.py) |
| 127 | +- [`README.md`](../../README.md) "Why not Context7 or generic docs retrieval?" |
| 128 | + and "PyPI package docs lookup" |
0 commit comments