Skip to content

Commit b3fe28e

Browse files
committed
feat: expand documentation and architecture details for NL2SQL
- Updated mkdocs.yml to enhance navigation, adding new sections for Indexing and Extensions, and reorganizing existing content for clarity. - Introduced a new glossary.md file to define core concepts and terminology used throughout the documentation. - Enhanced getting-started.md with instructions for indexing datasource schemas before query execution. - Added detailed architecture documents for pipeline, indexing, and various nodes, improving understanding of system components and their interactions. - Included multiple ADRs to capture architectural decisions related to chunking strategy, schema store design, adapter abstraction, deterministic planning, and artifact storage.
1 parent 275157e commit b3fe28e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+3057
-268
lines changed

docs/adapters/architecture.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Plugin / Adapter Architecture
22

3-
Adapters are discovered via Python entry points (`nl2sql.adapters`) and registered in `DatasourceRegistry`. All adapters implement the `DatasourceAdapterProtocol` and return a standardized `ResultFrame`.
3+
Adapters integrate NL2SQL with external datasources. Each adapter implements a **protocol contract**, is discovered via **Python entry points**, and is registered in the `DatasourceRegistry`.
44

55
## Discovery and registration
66

@@ -13,7 +13,7 @@ flowchart TD
1313
AdapterClass --> AdapterInstance[DatasourceAdapterProtocol instance]
1414
```
1515

16-
## Core contracts
16+
## Core adapter contract
1717

1818
```mermaid
1919
classDiagram
@@ -27,9 +27,7 @@ classDiagram
2727
class AdapterRequest {
2828
+plan_type
2929
+payload
30-
+parameters
3130
+limits
32-
+trace_id
3331
}
3432
class ResultFrame {
3533
+success
@@ -40,13 +38,37 @@ classDiagram
4038
}
4139
```
4240

43-
## Executor integration
41+
## Capability-driven routing
4442

45-
Execution nodes resolve the executor via `ExecutorRegistry`, which maps datasource capabilities to executor implementations (e.g., `SqlExecutorService` for SQL).
43+
Adapters expose capabilities (e.g., `supports_sql`, `supports_schema_introspection`). These capabilities drive:
44+
45+
- **Subgraph selection** (`resolve_subgraph()` in routing).
46+
- **Executor selection** (`ExecutorRegistry.get_executor()`).
47+
48+
```mermaid
49+
flowchart TD
50+
Adapter[DatasourceAdapterProtocol] --> Caps[capabilities()]
51+
Caps --> Exec[ExecutorRegistry]
52+
Caps --> Subgraph[resolve_subgraph()]
53+
Exec --> Service[Executor Service]
54+
Subgraph --> Graph[Subgraph Selection]
55+
```
56+
57+
## Multi-datasource routing
58+
59+
The control graph can resolve multiple datasources for a single user query. `DecomposerNode` produces sub-queries scoped to individual datasources. Each sub-query is then routed to a subgraph that matches its adapter capabilities.
60+
61+
## Extensibility model
62+
63+
To add a new adapter:
64+
65+
1. Implement `DatasourceAdapterProtocol` (or extend a base adapter).
66+
2. Publish the adapter class as an `nl2sql.adapters` entry point.
67+
3. Configure the datasource in `configs/datasources.yaml`.
4668

4769
## Source references
4870

49-
- Adapter protocol and contracts: `packages/adapter-sdk/src/nl2sql_adapter_sdk/protocols.py`, `packages/adapter-sdk/src/nl2sql_adapter_sdk/contracts.py`
71+
- Adapter protocol: `packages/adapter-sdk/src/nl2sql_adapter_sdk/protocols.py`
5072
- Adapter discovery: `packages/core/src/nl2sql/datasources/discovery.py`
5173
- Datasource registry: `packages/core/src/nl2sql/datasources/registry.py`
52-
- Executor registry: `packages/core/src/nl2sql/execution/executor/registry.py`
74+
- Example adapter: `packages/adapter-sqlalchemy/src/nl2sql_sqlalchemy_adapter/adapter.py`
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# ADR-003: Schema Chunking Strategy
2+
3+
## Status
4+
5+
Accepted (implemented in `SchemaChunkBuilder` and `VectorStore`).
6+
7+
## Context
8+
9+
Full-schema injection into LLM prompts is brittle and expensive. Retrieval needs to be **semantically structured** so that:
10+
11+
- Datasource routing is reliable.
12+
- Schema grounding is precise.
13+
- Planning context is scoped to relevant tables and columns.
14+
15+
## Decision
16+
17+
Use **typed schema chunks** with staged retrieval:
18+
19+
- `schema.datasource` for datasource routing and grounding.
20+
- `schema.table` for table-level context and primary keys.
21+
- `schema.column` for column semantics and statistics.
22+
- `schema.relationship` for explicit join hints.
23+
24+
Retrieval is staged in `SchemaRetrieverNode`:
25+
26+
1. `retrieve_schema_context()` (tables/metrics)
27+
2. fallback to `retrieve_column_candidates()` if no tables found
28+
3. `retrieve_planning_context()` for columns/relationships of selected tables
29+
30+
## Consequences
31+
32+
- Reduces LLM context to schema slices relevant to the query.
33+
- Preserves authoritative schema by resolving final context from `SchemaStore`.
34+
- Enables deterministic and explainable retrieval behavior.
35+
36+
## Source references
37+
38+
- Chunk models: `packages/core/src/nl2sql/indexing/models.py`
39+
- Chunk builder: `packages/core/src/nl2sql/indexing/chunk_builder.py`
40+
- Retrieval: `packages/core/src/nl2sql/indexing/vector_store.py`
41+
- Schema retriever: `packages/core/src/nl2sql/pipeline/nodes/schema_retriever/node.py`
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# ADR-004: Schema Store Design and Fingerprinting
2+
3+
## Status
4+
5+
Accepted (implemented in `SqliteSchemaStore` and `InMemorySchemaStore`).
6+
7+
## Context
8+
9+
The system needs an authoritative, versioned view of each datasource schema. Vector indexes may drift or be stale, so planning must reference a canonical schema snapshot.
10+
11+
## Decision
12+
13+
Store schema snapshots with **deterministic fingerprints**:
14+
15+
- `SchemaContract` content is hashed to produce a stable fingerprint.
16+
- Snapshots are versioned using timestamp + fingerprint prefix.
17+
- Older versions are evicted beyond a configurable maximum.
18+
19+
Persistent storage is provided by a SQLite-backed schema store, with an in-memory alternative for testing.
20+
21+
## Consequences
22+
23+
- Schema versions are stable and deduplicated.
24+
- Retrieval uses authoritative snapshots even if vector chunks drift.
25+
- The system can enforce version mismatch policies.
26+
27+
## Source references
28+
29+
- Fingerprinting: `packages/core/src/nl2sql/schema/protocol.py`
30+
- Sqlite store: `packages/core/src/nl2sql/schema/sqlite_store.py`
31+
- In-memory store: `packages/core/src/nl2sql/schema/in_memory_store.py`
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# ADR-005: Adapter Abstraction and Capability Routing
2+
3+
## Status
4+
5+
Accepted (implemented via `DatasourceAdapterProtocol`, registries, and routing).
6+
7+
## Context
8+
9+
NL2SQL must support heterogeneous datasources (SQL, REST, GraphQL, etc.) while keeping orchestration stable and deterministic.
10+
11+
## Decision
12+
13+
Adopt a **capability-driven adapter abstraction**:
14+
15+
- Adapters implement `DatasourceAdapterProtocol`.
16+
- Capabilities are declared via `capabilities()`.
17+
- Routing and execution select services/subgraphs based on capability subsets.
18+
19+
Adapters are discovered via Python entry points and registered at runtime based on configuration.
20+
21+
## Consequences
22+
23+
- New datasources can be integrated without changing core orchestration.
24+
- Subgraphs and executors remain decoupled and capability-focused.
25+
- Capability mismatches fail fast with clear errors.
26+
27+
## Source references
28+
29+
- Adapter protocol: `packages/adapter-sdk/src/nl2sql_adapter_sdk/protocols.py`
30+
- Adapter discovery: `packages/core/src/nl2sql/datasources/discovery.py`
31+
- Datasource registry: `packages/core/src/nl2sql/datasources/registry.py`
32+
- Routing: `packages/core/src/nl2sql/pipeline/routes.py`
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# ADR-006: Deterministic Planning and Stable IDs
2+
3+
## Status
4+
5+
Accepted (implemented in decomposer and DAG models).
6+
7+
## Context
8+
9+
Enterprise workflows require repeatable orchestration to enable reliable caching, debugging, and audit trails. Non-deterministic planning introduces unstable IDs and inconsistent execution paths.
10+
11+
## Decision
12+
13+
Use **stable hashes and deterministic layering**:
14+
15+
- `DecomposerNode` generates stable sub-query and post-op IDs by hashing content.
16+
- `ExecutionDAG._layered_toposort()` deterministically computes execution layers.
17+
- Aggregation processes layers in deterministic order.
18+
19+
## Consequences
20+
21+
- Artifact keys and execution node IDs are stable across runs.
22+
- Deterministic routing and aggregation simplify debugging and auditing.
23+
- Planning remains reproducible even when subgraphs run in parallel.
24+
25+
## Source references
26+
27+
- Decomposer: `packages/core/src/nl2sql/pipeline/nodes/decomposer/node.py`
28+
- Execution DAG: `packages/core/src/nl2sql/pipeline/nodes/global_planner/schemas.py`
29+
- Aggregation: `packages/core/src/nl2sql/aggregation/aggregator.py`
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# ADR-007: Artifact Storage for Execution Results
2+
3+
## Status
4+
5+
Accepted (implemented in `ArtifactStore` and executor services).
6+
7+
## Context
8+
9+
Query execution results need to be persisted for aggregation and downstream usage. Persisting raw results in memory would be expensive and non-durable for multi-step DAGs.
10+
11+
## Decision
12+
13+
Persist execution results as Parquet artifacts:
14+
15+
- Adapters return `ResultFrame` objects.
16+
- `SqlExecutorService` writes results to an `ArtifactStore`.
17+
- Aggregation reads artifacts and applies combine/post operations.
18+
19+
Backends are pluggable (`local`, `s3`, `adls`).
20+
21+
## Consequences
22+
23+
- Results are durable across pipeline stages.
24+
- Aggregation operates on persisted artifacts, reducing memory pressure.
25+
- Backends can be swapped without changing executor logic.
26+
27+
## Source references
28+
29+
- Artifact store base: `packages/core/src/nl2sql/execution/artifacts/base.py`
30+
- Local store: `packages/core/src/nl2sql/execution/artifacts/local_store.py`
31+
- Executor service: `packages/core/src/nl2sql/execution/executor/sql_executor.py`

docs/adr/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,8 @@ This section captures architectural decisions that are reflected in the current
44

55
- `adr-001-sandboxed-execution.md`
66
- `adr-002-circuit-breakers.md`
7+
- `adr-003-chunking-strategy.md`
8+
- `adr-004-schema-store-design.md`
9+
- `adr-005-adapter-abstraction.md`
10+
- `adr-006-deterministic-planning.md`
11+
- `adr-007-artifact-storage.md`

docs/agents/architecture.md

Lines changed: 0 additions & 38 deletions
This file was deleted.

docs/agents/nodes.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,3 +153,24 @@ flowchart LR
153153
- Adds a `PLAN_FEEDBACK` warning to drive retry logic.
154154
- **Errors**: `MISSING_LLM`, `REFINER_FAILED`
155155
- **Source**: `packages/core/src/nl2sql/pipeline/nodes/refiner/node.py`
156+
157+
## Shared state model (subgraph)
158+
159+
`SubgraphExecutionState` carries:
160+
161+
- `sub_query`, `user_context`, `relevant_tables`
162+
- `ast_planner_response`, `logical_validator_response`, `generator_response`, `executor_response`
163+
- `retry_count`, `errors`, `reasoning`, `warnings`
164+
165+
The subgraph state is merged back into `GraphState` via `wrap_subgraph()`, which extracts artifacts and diagnostics into `SubgraphOutput`.
166+
167+
## Retry and failure semantics
168+
169+
- Planner/validator failures trigger the `retry_handler` path if errors are retryable and `retry_count < sql_agent_max_retries`.
170+
- Critical failures (e.g., RBAC violations, missing plan) are non-retryable.
171+
- Physical validation exists but is currently not wired in the default subgraph.
172+
173+
## Deterministic behavior notes
174+
175+
- Sub-query IDs are deterministic hashes, ensuring stable artifact keys.
176+
- `ExecutionDAG` layers are deterministic, so subgraph invocation order is stable.

0 commit comments

Comments
 (0)