|
| 1 | +# Contributing to @questdb/sql-parser |
| 2 | + |
| 3 | +## Setup |
| 4 | + |
| 5 | +```bash |
| 6 | +yarn # Install dependencies |
| 7 | +yarn build # Compile TypeScript (tsup + tsc) |
| 8 | +yarn test # Run all tests (6,100+ tests) |
| 9 | +yarn test:watch # Run tests in watch mode |
| 10 | +yarn typecheck # Type-check without emitting |
| 11 | +yarn lint # Run ESLint |
| 12 | +yarn lint:fix # Auto-fix lint issues |
| 13 | +yarn generate:cst # Regenerate CST type definitions from parser grammar |
| 14 | +yarn clean # Remove dist/ and coverage/ |
| 15 | +``` |
| 16 | + |
| 17 | +## Pipeline Overview |
| 18 | + |
| 19 | +Every SQL string flows through this pipeline: |
| 20 | + |
| 21 | +``` |
| 22 | +SQL String ──> Lexer (tokens.ts/lexer.ts) ──> Token[] |
| 23 | + │ |
| 24 | +Token[] ──────> Parser (parser.ts) ───────> CST (Concrete Syntax Tree) |
| 25 | + │ |
| 26 | +CST ──────────> Visitor (visitor.ts) ──────> AST (typed, clean) |
| 27 | + │ |
| 28 | +AST ──────────> toSql (toSql.ts) ──────────> SQL String (round-trip) |
| 29 | +``` |
| 30 | + |
| 31 | +The **CST** is Chevrotain's lossless tree that preserves every token. The **visitor** transforms it into a clean, typed **AST** that is easy to work with. `toSql()` converts any AST node back to valid SQL. |
| 32 | + |
| 33 | +For **autocomplete**, the flow is: |
| 34 | + |
| 35 | +``` |
| 36 | +SQL + cursor offset ──> content-assist.ts ──> parser.computeContentAssist() |
| 37 | + │ |
| 38 | + nextTokenTypes + tablesInScope + cteColumns |
| 39 | + │ |
| 40 | + suggestion-builder.ts ──> Suggestion[] (filtered, prioritized) |
| 41 | +``` |
| 42 | + |
| 43 | +## How Tokens Work |
| 44 | + |
| 45 | +Grammar arrays (`src/grammar/keywords.ts`, `dataTypes.ts`, `constants.ts`) are the source of truth. `src/parser/tokens.ts` auto-generates Chevrotain tokens from them: |
| 46 | + |
| 47 | +1. Each keyword string is converted to a PascalCase token name (`"select"` → `Select`, `"data_page_size"` → `DataPageSize`) |
| 48 | +2. Each token gets a case-insensitive regex pattern with word boundary (e.g., `/select\b/i`) |
| 49 | +3. Non-reserved keywords are assigned to the `IdentifierKeyword` category, which lets the parser accept them as table/column names via a single `CONSUME(IdentifierKeyword)` rule |
| 50 | + |
| 51 | +The `IDENTIFIER_KEYWORD_NAMES` set in `tokens.ts` controls which keywords are non-reserved. Reserved keywords (SELECT, FROM, WHERE, JOIN, etc.) are **not** in this set and cannot be used as unquoted identifiers. |
| 52 | + |
| 53 | +## Workflow: Adding a New Keyword |
| 54 | + |
| 55 | +Example: adding a hypothetical `RETENTION` keyword. |
| 56 | + |
| 57 | +**1. Add to grammar** — `src/grammar/keywords.ts`: |
| 58 | +```typescript |
| 59 | +export const keywords: string[] = [ |
| 60 | + // ...existing keywords in alphabetical order... |
| 61 | + "retention", |
| 62 | + // ... |
| 63 | +] |
| 64 | +``` |
| 65 | +This auto-generates a `Retention` token in `tokens.ts`. |
| 66 | + |
| 67 | +**2. If non-reserved, mark it** — `src/parser/tokens.ts`: |
| 68 | + |
| 69 | +If the keyword can be used as an identifier (table/column name), add it to `IDENTIFIER_KEYWORD_NAMES`: |
| 70 | +```typescript |
| 71 | +export const IDENTIFIER_KEYWORD_NAMES = new Set([ |
| 72 | + // ... |
| 73 | + "Retention", |
| 74 | +]) |
| 75 | +``` |
| 76 | + |
| 77 | +Skip this step if the keyword is reserved (i.e., it introduces structural ambiguity as an identifier). |
| 78 | + |
| 79 | +**3. Use in parser grammar** — `src/parser/parser.ts`: |
| 80 | + |
| 81 | +Reference the token in a grammar rule: |
| 82 | +```typescript |
| 83 | +private retentionClause = this.RULE("retentionClause", () => { |
| 84 | + this.CONSUME(Retention) |
| 85 | + this.CONSUME(NumberLiteral) |
| 86 | + this.SUBRULE(this.partitionPeriod) // DAY, MONTH, etc. |
| 87 | +}) |
| 88 | +``` |
| 89 | + |
| 90 | +Make sure to import the token from `lexer.ts` at the top of `parser.ts`. The token is available by its PascalCase name. |
| 91 | + |
| 92 | +**4. Regenerate CST types**: |
| 93 | +```bash |
| 94 | +yarn generate:cst |
| 95 | +``` |
| 96 | +This reads the parser's grammar rules and regenerates `src/parser/cst-types.d.ts`. The new rule's CST children type will appear automatically (e.g., `RetentionClauseCstChildren`). |
| 97 | + |
| 98 | +**5. Add visitor method** — `src/parser/visitor.ts`: |
| 99 | + |
| 100 | +Import the new CST type from `cst-types.d.ts`, then add a visitor method: |
| 101 | +```typescript |
| 102 | +retentionClause(ctx: RetentionClauseCstChildren): AST.RetentionClause { |
| 103 | + return { |
| 104 | + type: "retentionClause", |
| 105 | + value: parseInt(ctx.NumberLiteral[0].image, 10), |
| 106 | + unit: this.visit(ctx.partitionPeriod[0]), |
| 107 | + } |
| 108 | +} |
| 109 | +``` |
| 110 | + |
| 111 | +**6. Add AST type** — `src/parser/ast.ts`: |
| 112 | +```typescript |
| 113 | +export interface RetentionClause extends AstNode { |
| 114 | + type: "retentionClause" |
| 115 | + value: number |
| 116 | + unit: string |
| 117 | +} |
| 118 | +``` |
| 119 | + |
| 120 | +**7. Add toSql serialization** — `src/parser/toSql.ts`: |
| 121 | +```typescript |
| 122 | +function retentionClauseToSql(clause: AST.RetentionClause): string { |
| 123 | + return `RETENTION ${clause.value} ${clause.unit}` |
| 124 | +} |
| 125 | +``` |
| 126 | +Wire it into the parent statement's toSql function. |
| 127 | + |
| 128 | +**8. Add tests** — `tests/parser.test.ts`: |
| 129 | +```typescript |
| 130 | +it("should parse RETENTION clause", () => { |
| 131 | + const result = parseToAst("CREATE TABLE t (x INT) RETENTION 30 DAY") |
| 132 | + expect(result.errors).toHaveLength(0) |
| 133 | + // assert AST structure... |
| 134 | +}) |
| 135 | + |
| 136 | +it("should round-trip RETENTION clause", () => { |
| 137 | + const sql = "CREATE TABLE t (x INT) RETENTION 30 DAY" |
| 138 | + const result = parseToAst(sql) |
| 139 | + const roundtrip = toSql(result.ast[0]) |
| 140 | + const result2 = parseToAst(roundtrip) |
| 141 | + expect(result2.errors).toHaveLength(0) |
| 142 | +}) |
| 143 | +``` |
| 144 | + |
| 145 | +**9. Run tests**: |
| 146 | +```bash |
| 147 | +yarn test |
| 148 | +``` |
| 149 | + |
| 150 | +## Workflow: Adding a New Statement Type |
| 151 | + |
| 152 | +Same as adding a keyword, but the scope is larger: |
| 153 | + |
| 154 | +1. **Grammar**: add all tokens to `src/grammar/keywords.ts` (and `src/parser/tokens.ts` if non-reserved) |
| 155 | +2. **Parser**: add a new top-level rule in `parser.ts`, register it in the `statement` rule's alternatives |
| 156 | +3. **CST types**: `yarn generate:cst` |
| 157 | +4. **AST**: add the statement interface to `ast.ts`, add it to the `Statement` union type |
| 158 | +5. **Visitor**: add visitor method in `visitor.ts` |
| 159 | +6. **toSql**: add serializer in `toSql.ts`, add the case to the `statementToSql` switch |
| 160 | +7. **Tests**: parse tests, AST structure assertions, and round-trip tests |
| 161 | + |
| 162 | +## Workflow: Modifying Autocomplete Behavior |
| 163 | + |
| 164 | +Autocomplete has four layers: |
| 165 | + |
| 166 | +1. **`content-assist.ts`** — determines what the parser expects at the cursor position. Extracts tables in scope (FROM/JOIN clauses), CTE definitions, and qualified references (e.g., `t1.`). You rarely need to modify this unless you're changing how scope is detected. |
| 167 | + |
| 168 | +2. **`token-classification.ts`** — classifies tokens into categories: `SKIP_TOKENS` (never suggested), `EXPRESSION_OPERATORS` (lower priority), `IDENTIFIER_KEYWORD_TOKENS` (trigger schema suggestions). When adding a new token, decide which category it belongs to. |
| 169 | + |
| 170 | +3. **`suggestion-builder.ts`** — converts parser token types + schema into `Suggestion[]`. Controls priority (columns > keywords > functions > tables), handles qualified references, and manages deduplication. |
| 171 | + |
| 172 | +4. **`provider.ts`** — orchestrates the above and adds context detection: after FROM → suggest tables, after SELECT → suggest columns, after `*` → suppress columns (alias position), etc. The `getIdentifierSuggestionScope()` function is the main context switcher. |
| 173 | + |
| 174 | +## Key Concepts |
| 175 | + |
| 176 | +**Reserved vs. non-reserved keywords**: QuestDB has ~60 reserved keywords. Everything else (data types, time units, config keys like `maxUncommittedRows`) is non-reserved and can be used as an unquoted identifier. The `IdentifierKeyword` token category in Chevrotain handles this — the parser's `identifier` rule accepts any `IdentifierKeyword` token. |
| 177 | + |
| 178 | +**CST vs. AST**: The CST preserves every token (including keywords, punctuation, whitespace position). The AST is a clean semantic representation. The visitor decides what to keep. For example, the CST has separate `Select`, `Star`, `From` tokens; the AST just has `{ type: "select", columns: [{ type: "star" }], from: [...] }`. |
| 179 | + |
| 180 | +**Round-trip correctness**: `toSql(parseToAst(sql).ast)` must produce SQL that parses to an equivalent AST. This is verified against 1,726 real queries in `docs-roundtrip.test.ts`. When adding new features, always test round-trip. |
| 181 | + |
| 182 | +**Error recovery**: The parser uses Chevrotain's semicolon-based error recovery. When a statement fails to parse, it skips to the next semicolon and continues. The visitor handles incomplete CST nodes with try-catch. This means `parseToAst()` can return both `ast` (partial) and `errors` simultaneously. |
0 commit comments