Skip to content

Commit eb63802

Browse files
committed
docs: Add detailed documentation for NL2SQL pipeline nodes and update mkdocs configuration to include them.
1 parent 5a1a7c5 commit eb63802

19 files changed

Lines changed: 222 additions & 437 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,5 @@ data/*.db
2020
chroma_db
2121

2222
logs
23+
24+
site

docs/nodes/aggregator_node.md

Lines changed: 18 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -2,46 +2,40 @@
22

33
## Purpose
44

5-
The `AggregatorNode` is responsible for consolidating results from multiple parallel execution branches (triggered by the `DecomposerNode`) into a single, coherent response. It handles data merging, de-duplication, and formatting (e.g., combining two partial tables into one).
5+
The `AggregatorNode` combines results from the execution phase and prepares the final response. It implements a "Fast Path" for direct data streaming and a "Slow Path" for LLM-based summarization or answer synthesis.
66

7-
## Components
7+
## Class Reference
88

9-
- **`LLM`**: Used to synthesize the final answer and decide the best presentation format.
10-
- **`AggregatedResponse`**: Structured output schema.
9+
- **Class**: `AggregatorNode`
10+
- **Path**: `packages/core/src/nl2sql/pipeline/nodes/aggregator/node.py`
1111

1212
## Inputs
1313

1414
The node reads the following fields from `GraphState`:
1515

16-
- `state.intermediate_results`: A list of results collected from all parallel branches (each containing execution data or errors).
17-
- `state.user_query`: The original global query.
18-
- `state.errors`: List of errors encountered in the branches (to report partial failures).
16+
- `state.user_query` (str): The user's question.
17+
- `state.intermediate_results` (List): Results from the executor(s).
18+
- `state.output_mode` (str): "data" (Fast Path) or "summary"/"verbose" (Slow Path).
19+
- `state.errors` (List[PipelineError]): Any errors to include in the summary.
1920

2021
## Outputs
2122

2223
The node updates the following fields in `GraphState`:
2324

24-
- `state.final_answer`: A markdown-formatted string containing the summary and combined data.
25-
- `state.reasoning`: Log entry describing the chosen format.
26-
- `state.errors`: Appends `PipelineError` if aggregation fails.
25+
- `state.final_answer` (Any): The final text entry or data payload.
26+
- `state.reasoning` (List[Dict]): Log of which path was taken.
2727

2828
## Logic Flow
2929

3030
1. **Fast Path Check**:
31-
- Checks if `state.response_type` is `TABULAR` or `KPI`.
32-
- If true, and there is a single successful result, returns `final_answer=None`.
33-
- This signals the Presentation Layer (CLI) to display the raw `ExecutionModel` directly.
34-
2. **Slow Path (LLM)**:
35-
- If `state.response_type` is `SUMMARY` or multiple results exist.
36-
- Formats all `intermediate_results` into a single text block.
37-
- Invokes the LLM to synthesize an answer (`AggregatedResponse`).
38-
3. **Formatting**:
39-
- Constructs a markdown string combining the summary and content.
31+
- If there is exactly one result, no errors, and `output_mode` is "data":
32+
- Returns the raw data directly.
33+
2. **Slow Path (LLM Aggregation)**:
34+
- Formats all `intermediate_results` (and errors) into a string.
35+
- Prompts the LLM to synthesize an answer to the `user_query` using the provided data.
36+
- Formats the LLM output (Table/List/Text).
37+
- Returns the generated summary.
4038

4139
## Error Handling
4240

43-
- **`AGGREGATOR_FAILED`**: If the LLM output is malformed or processing fails.
44-
45-
## Dependencies
46-
47-
- `nl2sql.nodes.aggregator.schemas.AggregatedResponse`
41+
- **`AGGREGATOR_FAILED`**: If the LLM summarization fails.

docs/nodes/decomposer_node.md

Lines changed: 24 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,47 +2,45 @@
22

33
## Purpose
44

5-
The `DecomposerNode` acts as the **Router** and **Orchestrator** of the pipeline. It parses the canonicalized user query and breaks it down into independent sub-queries, each targeted at a specific datasource. This is crucial for handling multi-datasource requests or complex analytical questions.
5+
The `DecomposerNode` acts as the entry point and router for the pipeline. It is responsible for analyzing the user's query to determine which datasource(s) should handle the request. For complex requests, it can break the query down into sub-queries (though simple routing is the primary function). It also checks user authorization before proceeding.
66

7-
## Components
7+
## Class Reference
88

9-
- **`LLM`**: Used to perform the decomposition and reasoning.
10-
- **`OrchestratorVectorStore`**: Provides relevant schema context for the LLM to make informed routing decisions.
11-
- **`DatasourceRegistry`**: Provides metadata (descriptions) about available data sources.
9+
- **Class**: `DecomposerNode`
10+
- **Path**: `packages/core/src/nl2sql/pipeline/nodes/decomposer/node.py`
1211

1312
## Inputs
1413

1514
The node reads the following fields from `GraphState`:
1615

17-
- `state.semantic_analysis`: The **enriched** query context containing canonical query and synonyms (from SemanticAnalysisNode).
18-
- `state.selected_datasource_id`: (Optional) If set, the node acts in "Pass-through" mode.
16+
- `state.user_query` (str): The initial user question.
17+
- `state.user_context` (Dict): User session data, specifically `allowed_datasources` for authorization.
18+
- `state.semantic_analysis` (SemanticAnalysisResponse): Used to expand the query with keywords/synonyms for better vector retrieval.
1919

2020
## Outputs
2121

2222
The node updates the following fields in `GraphState`:
2323

24-
- `state.sub_queries`: A list of `SubQuery` objects, each containing:
25-
- `datasource_id`: Target database.
26-
- `query`: The specific question for that database.
27-
- `candidate_tables`: (Optional) Pre-identified tables.
28-
- `state.reasoning`: Log entry explaining the decomposition logic.
29-
- `state.errors`: Appends `PipelineError` if orchestration fails.
24+
- `state.sub_queries` (List[SubQuery]): A list of routed queries. Each `SubQuery` contains:
25+
- `question`: The specific question for the datasource.
26+
- `datasource_id`: The ID of the chosen datasource.
27+
- `state.confidence` (float): The confidence score of the routing decision.
28+
- `state.reasoning` (List[Dict]): Explanation of why a specific datasource was selected.
29+
- `state.errors` (List[PipelineError]): `SECURITY_VIOLATION` if the user lacks access.
3030

3131
## Logic Flow
3232

33-
1. **Direct Execution Check**:
34-
- If `state.selected_datasource_id` is already present, it creates a single `SubQuery` targeting that datasource.
35-
2. **Context Retrieval**:
36-
- Uses `state.user_query` + `state.enriched_terms` to query the `VectorStore`.
37-
3. **LLM Decomposition**:
38-
- Prompts the LLM with the query, available datasources, and retrieved schema context.
39-
- The LLM generates a plan (`DecomposerResponse`) consisting of one or more sub-queries.
40-
4. **State Update**: The resulting `sub_queries` are stored in the state, which triggers parallel execution branches.
33+
1. **Authorization Check**: Verifies if `state.user_context` contains accessible datasources. If not, returns `SECURITY_VIOLATION`.
34+
2. **Query Expansion**: If `state.semantic_analysis` is present, it augments the query with keywords and synonyms to improve retrieval recall.
35+
3. **Context Retrieval**:
36+
- Queries the `OrchestratorVectorStore` using the expanded query.
37+
- Retrieves relevant table schemas and datasource descriptions.
38+
4. **LLM Routing**:
39+
- Uses the LLM to analyze the retrieved context and the user query.
40+
- Decides which datasource is best suited to answer the question.
41+
5. **Output Generation**: Returns the routing decision (datasource selection) and confidence score.
4142

4243
## Error Handling
4344

44-
- **`ORCHESTRATOR_CRASH`**: Critical failure in the decomposition process (e.g., LLM error, context retrieval failure).
45-
46-
## Dependencies
47-
48-
- `nl2sql.nodes.decomposer.schemas.DecomposerResponse`
45+
- **`SECURITY_VIOLATION`**: Critical error if the user has no allowed datasources.
46+
- **Retrieval Warnings**: Logs warnings if no relevant documents are found in the vector store.

docs/nodes/direct_sql_node.md

Lines changed: 0 additions & 40 deletions
This file was deleted.

docs/nodes/executor_node.md

Lines changed: 21 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2,55 +2,43 @@
22

33
## Purpose
44

5-
The `ExecutorNode` is responsible for waiting for a SQL query (draft) and executing it against the actual database engine. It acts as the final "Effector" in the pipeline. It strictly enforces security protocols to prevent mutation or data loss.
5+
The `ExecutorNode` is responsible for executing the generated SQL query against the target datasource. It handles connection management via the `DatasourceRegistry` adapters, safeguards against massive result sets, and formats the output.
66

7-
## Components
7+
## Class Reference
88

9-
- **`DatasourceRegistry`**: To obtain the database engine/connection.
10-
- **`enforce_read_only`**: Security utility to scan for forbidden SQL keywords (INSERT, UPDATE, DROP, etc.).
11-
- **`engine_factory.run_read_query`**: Helper to execute the query.
9+
- **Class**: `ExecutorNode`
10+
- **Path**: `packages/core/src/nl2sql/pipeline/nodes/executor/node.py`
1211

1312
## Inputs
1413

1514
The node reads the following fields from `GraphState`:
1615

17-
- `state.sql_draft`: The SQL query string to execute.
18-
- `state.datasource_id`: ID of the target datasource.
16+
- `state.sql_draft` (str): The SQL query to execute.
17+
- `state.selected_datasource_id` (str): The target database ID.
1918

2019
## Outputs
2120

2221
The node updates the following fields in `GraphState`:
2322

24-
- `state.execution`: A structured `ExecutionModel` containing:
25-
- `row_count`: Number of rows returned.
26-
- `rows`: List of dictionaries representing the result set.
27-
- `columns`: List of column names.
28-
- `error`: String description of any database error.
29-
- `state.reasoning`: Log entry summarizing the execution stats.
30-
- `state.errors`: Appends `PipelineError` if security check fails or DB throws an error.
23+
- `state.execution` (`ExecutionModel`): The result of the query.
24+
- `columns` (List[str]): Column names.
25+
- `rows` (List[Dict]): The data returned.
26+
- `row_count` (int): Number of rows.
27+
- `state.errors` (List[PipelineError]): Errors during execution.
3128

3229
## Logic Flow
3330

34-
1. **Validation**: Checks if `sql_draft` and `datasource_id` are present.
35-
2. **Datasource Resolution**: Identifies the primary datasource if a list was provided.
36-
3. **Security Check**:
37-
- Detects the dialect based on the profile.
38-
- Calls `enforce_read_only` to validate the SQL.
39-
- If violation is found, returns `SECURITY_VIOLATION` critical error.
31+
1. **Validation**: Ensures `sql_draft` and `datasource_id` are present.
32+
2. **Adapter Retrieval**: Fetches the correct adapter (e.g., PostgresAdapter) from the registry.
33+
3. **Cost Estimation (Safeguard)**:
34+
- If supported by the adapter, estimates the query cost.
35+
- If the estimated row count exceeds `SAFEGUARD_ROW_LIMIT` (10,000), aborts execution and raises `SAFEGUARD_VIOLATION`.
4036
4. **Execution**:
41-
- Uses SQLAlchemy engine to run the query.
42-
- Fetches all results and maps them to a list of dictionaries.
43-
- Captures metadata (column names).
44-
5. **Result Packaging**: Wraps results or exceptions into the `ExecutionModel`.
37+
- Runs `adapter.execute(sql)`.
38+
- Captures the result set.
39+
5. **Formatting**: Converts the results into the standard `ExecutionModel`.
4540

4641
## Error Handling
4742

48-
- **`MISSING_SQL`**: If generator failed to produce output.
49-
- **`SECURITY_VIOLATION`**: If DML/DDL keywords are detected.
50-
- **`DB_EXECUTION_ERROR`**: Runtime errors from the database (e.g., syntax error, invalid table).
51-
- **`EXECUTOR_CRASH`**: Unhandled python exceptions.
52-
53-
## Dependencies
54-
55-
- `nl2sql.security`
56-
- `nl2sql.engine_factory`
43+
- **`SAFEGUARD_VIOLATION`**: If the query is predicted to return too many rows.
44+
- **`DB_EXECUTION_ERROR`**: If the database raises an exception (e.g., timeout, syntax error not caught by validator).

docs/nodes/generator_node.md

Lines changed: 21 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,52 +2,42 @@
22

33
## Purpose
44

5-
The `GeneratorNode` is responsible for converting the abstract query plan (generated by the `PlannerNode`) into a concrete, syntactically correct SQL query. It uses `sqlglot` to handle dialect differences (e.g., PostgreSQL vs T-SQL) and enforces system-wide guardrails like row limits.
5+
The `GeneratorNode` is the compiler of the pipeline. It takes the abstract execution plan (`PlanModel`) produced by the Planner and generates a valid, dialect-specific SQL string. It uses `sqlglot` to transpile the internal AST into the target SQL dialect (e.g., PostgreSQL, T-SQL, MySQL), enforcing syntactic correctness.
66

7-
## Components
7+
## Class Reference
88

9-
- **`sqlglot`**: A powerful SQL parser and transpiler library used to construct the query AST programmatically.
10-
- **`DatasourceRegistry`**: Used to determine the profile and specific SQL dialect of the target database.
9+
- **Class**: `GeneratorNode`
10+
- **Path**: `packages/core/src/nl2sql/pipeline/nodes/generator/node.py`
1111

1212
## Inputs
1313

1414
The node reads the following fields from `GraphState`:
1515

16-
- `state.plan`: The structured dictionary representing the logical query plan (SELECT, FROM, JOINs, WHERE, etc.).
17-
- `state.datasource_id`: ID of the target datasource (used to resolve dialect).
16+
- `state.plan` (`PlanModel`): The logical plan to compile.
17+
- `state.selected_datasource_id` (str): The ID of the target database, used to determine the SQL dialect.
1818

1919
## Outputs
2020

2121
The node updates the following fields in `GraphState`:
2222

23-
- `state.sql_draft`: The generated SQL query string.
24-
- `state.reasoning`: Log entry showing the generated SQL and rationale.
25-
- `state.errors`: Appends `PipelineError` if generation fails.
23+
- `state.sql_draft` (str): The generated SQL query string.
24+
- `state.reasoning` (List[Dict]): Logs the generated SQL.
25+
- `state.errors` (List[PipelineError]): `SQL_GEN_FAILED` if compilation errors occur.
2626

2727
## Logic Flow
2828

29-
1. **Preparation**:
30-
- Validates presence of `datasource_id` and `plan`.
31-
- Retrieves the correct dialect capability (e.g., "postgres", "tsql") from the registry.
32-
- Determines the row limit (minimum of system limit or plan limit).
33-
34-
2. **Visitor Compilation**:
35-
- Instantiates a `SqlVisitor` class to traverse the Recursive AST.
36-
- **Recursion**: Calls `visit(expr)` for every node in the tree.
37-
- **Dispatch**: Routes checks to `visit_binary`, `visit_literal`, `visit_func`, etc.
38-
- **Strict Ordering**: Sorts all lists (`tables`, `select_items`) by `ordinal` before visiting.
39-
- **Compilation**: Returns pure `sqlglot` expression objects (no string parsing).
40-
41-
3. **Transpilation**:
42-
- Calls `query.sql(dialect=target_dialect)` to generate the final string string matching the target database's syntax.
29+
1. **Validation**: Checks if a plan and a datasource ID are present in the state.
30+
2. **Profile Lookup**: Fetches the `dialect` (e.g., "postgres", "tsql") and default `row_limit` from the datasource registry.
31+
3. **AST Transformation (`SqlVisitor`)**:
32+
- The node uses a `SqlVisitor` class to traverse the `PlanModel` (Expr tree).
33+
- It builds a corresponding `sqlglot` Expression tree.
34+
- This visitor handles literals, columns, functions, binary/unary operations, and case statements.
35+
4. **SQL Synthesis**:
36+
- Constructs the top-level `SELECT` statement using `sqlglot` builders.
37+
- Applies transformations for `SELECT`, `FROM` (Tables), `JOIN`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`.
38+
- Handles dialect-specific nuances (e.g., quoting identifiers, function names) via `sqlglot.transpile` mechanisms (implicit in `.sql(dialect=...)`).
39+
5. **Output**: Returns the final SQL string.
4340

4441
## Error Handling
4542

46-
- **`MISSING_DATASOURCE_ID`**: If router failed to set a datasource.
47-
- **`MISSING_PLAN`**: If planner failed to produce a plan.
48-
- **`SQL_GEN_FAILED`**: If the plan contains invalid structure or references that `sqlglot` cannot parse.
49-
50-
## Dependencies
51-
52-
- `sqlglot` library
53-
- `nl2sql.capabilities`
43+
- **`SQL_GEN_FAILED`**: Raised if the visitor encounters unknown expression types or if `sqlglot` fails to generate the string.

0 commit comments

Comments
 (0)