Skip to content

Commit 55616b2

Browse files
author
zhaoge
committed
feat: optimize collecting entity when match empty column in entityCollecting context
1 parent cd3626c commit 55616b2

34 files changed

Lines changed: 33564 additions & 28838 deletions

AGENTS.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# AGENTS.md
2+
3+
## 项目概述
4+
5+
**dt-sql-parser** 是一个基于 [ANTLR4](https://github.com/antlr/antlr4) 构建的 SQL 解析器库,面向 **大数据** 领域。它提供以下功能:
6+
7+
- SQL 语法校验
8+
- AST 遍历(Visitor / Listener 模式)
9+
- 代码补全(基于 [antlr4-c3](https://github.com/mike-lischke/antlr4-c3)
10+
- 实体提取(表、列等)
11+
- SQL 语句拆分
12+
13+
**支持的 SQL 方言**:MySQL、Flink、Spark、Hive、PostgreSQL、Trino、Impala。
14+
15+
## 技术栈
16+
17+
| 类别 | 说明 |
18+
| -------------- | ----------------------------------------------------- |
19+
| 语言 | TypeScript(目标 ES6,模块 ESNext) |
20+
| 运行环境 | Node.js >= 18 |
21+
| 包管理器 | pnpm 9.7.0 |
22+
| 构建工具 | `tsc`(TypeScript 编译器) |
23+
| 测试框架 | Jest(配合 `@swc/jest` 转换器) |
24+
| 解析器生成 | ANTLR4,通过 `antlr4ng` + `antlr4ng-cli` |
25+
| 代码补全 | `antlr4-c3` |
26+
| 代码格式化 | Prettier、`antlr-format-cli`(用于 `.g4` 文件) |
27+
| Git Hooks | Husky + lint-staged + commitlint |
28+
29+
## 仓库结构
30+
31+
```
32+
dt-sql-parser/
33+
├── src/
34+
│ ├── grammar/ # ANTLR4 .g4 语法文件(每个方言一个子目录)
35+
│ ├── lib/ # 从 .g4 文件生成的 Lexer/Parser/Listener/Visitor
36+
│ ├── parser/ # SQL Parser 类实现
37+
│ │ ├── common/ # 基类(BasicSQL)、工具方法、共享类型
38+
│ │ ├── mysql/ # MySQL 专属解析器、实体收集器等
39+
│ │ ├── flink/
40+
│ │ ├── spark/
41+
│ │ ├── hive/
42+
│ │ ├── postgresql/
43+
│ │ ├── trino/
44+
│ │ └── impala/
45+
│ ├── locale/ # 国际化资源
46+
│ └── index.ts # 公共 API 导出
47+
├── test/ # 单元测试(结构与 src/ 对应)
48+
│ ├── parser/ # 按方言组织的测试
49+
│ │ ├── mysql/
50+
│ │ │ ├── syntax/ # 语法规则测试
51+
│ │ │ ├── suggestion/ # 代码补全测试
52+
│ │ │ └── contextCollect/ # 实体收集测试
53+
│ │ └── ...
54+
│ └── common/ # 共享测试工具
55+
├── benchmark/ # 性能基准测试
56+
├── scripts/ # 构建/发布实用脚本
57+
├── gen/ # 生成产物
58+
├── dist/ # 编译输出(npm 包)
59+
├── .husky/ # Git Hook 配置
60+
├── package.json
61+
├── tsconfig.json
62+
├── jest.config.js
63+
└── CONTRIBUTING.md # 贡献指南(新增方言步骤)
64+
```
65+
66+
## 核心开发命令
67+
68+
```bash
69+
pnpm install # 安装依赖
70+
pnpm antlr4 # 从所有 .g4 文件生成 TS
71+
pnpm antlr4 --lang mysql # 为指定方言生成
72+
pnpm build # 编译 TypeScript(rm -rf dist && tsc)
73+
pnpm test # 运行 Jest 单元测试
74+
pnpm benchmark # 运行性能基准测试
75+
pnpm check-types # 对 src/ 和 test/ 进行类型检查
76+
pnpm format # 用 Prettier 格式化所有文件
77+
pnpm format-g4 # 格式化 .g4 语法文件
78+
pnpm prettier-check # 检查格式是否符合要求(不修改)
79+
```
80+
81+
## 架构说明
82+
83+
### Parser 类层级
84+
85+
```
86+
BasicSQL (src/parser/common/)
87+
├── MySQL
88+
├── FlinkSQL
89+
├── SparkSQL
90+
├── HiveSQL
91+
├── PostgreSQL
92+
├── TrinoSQL
93+
└── ImpalaSQL
94+
```
95+
96+
每个方言的 Parser 类(例如 `src/parser/mysql/index.ts` 中的 `MySQL`)继承自 `BasicSQL`,并实现以下方法:
97+
98+
- `createLexerFromCharStream()` — 创建 ANTLR4 Lexer
99+
- `createParserFromTokenStream()` — 创建 ANTLR4 Parser
100+
- `splitListener` getter — 返回用于语句拆分的 `SQLSplitListener`
101+
- `createEntityCollector()` — 返回用于上下文/实体提取的 `SQLEntityCollector`
102+
- `processCandidates()` / `preferredRules()` — 代码补全逻辑(antlr4-c3)
103+
104+
### 每方言模块结构
105+
106+
每个 `src/parser/<dialect>/` 目录包含:
107+
108+
| 文件 | 用途 |
109+
| ------------------------------------------- | --------------------------------------------- |
110+
| `index.ts` | 主解析器类,继承 `BasicSQL` |
111+
| `<dialect>EntityCollector.ts` | 从 AST 中提取表、列、函数等实体 |
112+
| `<dialect>SplitListener.ts` | 通过分号/AST 拆分多语句 SQL |
113+
| `<dialect>ErrorListener.ts` | 自定义语法错误处理 |
114+
| `<dialect>SemanticContextCollector.ts` | 收集语义上下文(如 `isStatementBeginning`|
115+
116+
### 语法文件 → 代码生成流程
117+
118+
1. `.g4` 文件存放在 `src/grammar/<dialect>/` 目录
119+
2. 执行 `pnpm antlr4 [--lang <dialect>]` 生成:
120+
- `src/lib/<dialect>/<Dialect>Lexer.ts`
121+
- `src/lib/<dialect>/<Dialect>Parser.ts`
122+
- `src/lib/<dialect>/<Dialect>ParserListener.ts`
123+
- `src/lib/<dialect>/<Dialect>ParserVisitor.ts`
124+
3. `src/parser/<dialect>/` 中的 Parser 类消费这些生成的文件
125+
126+
### 语法规则约定
127+
128+
- 根规则必须命名为 `program`
129+
- 支持解析多条 SQL 语句
130+
- 关键字规则前缀为 `KW_`(例如 `KW_SELECT`
131+
- 不区分大小写的方言启用 case-insensitive 选项
132+
133+
## 公共 API(来自 `src/index.ts`
134+
135+
****`MySQL``FlinkSQL``SparkSQL``HiveSQL``PostgreSQL``TrinoSQL``ImpalaSQL`
136+
137+
**Listener/Visitor 类型**`MySqlParserListener``MySqlParserVisitor` 等(每个方言一对)
138+
139+
**枚举**`EntityContextType``StmtContextType`
140+
141+
**类型**`CaretPosition``Suggestions``SyntaxSuggestion``WordRange``TextSlice``SyntaxError``ParseError``ErrorListener``StmtContext``EntityContext``CommonEntityContext``ColumnEntityContext``FuncEntityContext`
142+
143+
## 测试规范
144+
145+
- 测试文件结构与 `src/` 对应,位于 `test/parser/<dialect>/`
146+
- 子目录:`syntax/`(语法)、`suggestion/`(补全)、`contextCollect/`(实体收集)
147+
- 使用 Jest 配合 `@swc/jest` 实现快速编译
148+
- 自定义匹配器定义在 `test/matchers.ts`
149+
- 运行命令:`pnpm test`
150+
151+
## 新增 SQL 方言(步骤概要)
152+
153+
1.`src/grammar/<name>/` 下添加 `.g4` 文件(PascalCase 命名,根规则 = `program`,关键字 = `KW_*`
154+
2. 执行 `pnpm antlr4 --lang <name>` → 生成 `src/lib/<name>/`
155+
3. 创建 `src/parser/<name>/index.ts` 继承 `BasicSQL`
156+
4.`test/parser/<name>/` 下添加测试(lexer、visitor、listener、validate)
157+
5. 实现 `SQLSplitListener` → 添加 `splitListener` getter
158+
6. 实现代码补全 → `processCandidates` + `preferredRules`,并在 `suggestion/` 下添加测试
159+
7. 实现 `SQLEntityCollector` + `createEntityCollector()`,并在 `contextCollect/` 下添加测试
160+
8.`src/parser/index.ts``src/index.ts` 中导出新类
161+
162+
## AI Agent 注意事项
163+
164+
- 修改 `.g4` 文件后 **必须** 执行 `pnpm antlr4`,确保 `src/lib/` 中的生成文件保持同步
165+
- **不要** 手动编辑 `src/lib/` 中的文件 —— 它们是由 `.g4` 自动生成的
166+
- 语法文件遵循 ANTLR4 约定;关键字规则必须带有 `KW_` 前缀
167+
- 项目使用 `antlr4ng`(非 Java antlr4 运行时)作为 TypeScript 目标
168+
- 代码补全依赖 `antlr4-c3`;修改补全逻辑前先了解该库
169+
- 实体收集器(`SQLEntityCollector`)是实现丰富代码补全的关键 —— 需理解作用域深度和 `isAccessible` 逻辑
170+
- 位置/范围约定:行号从 1 开始,列号从 1 开始,索引从 0 开始
171+
- Prettier 格式化通过 husky + lint-staged 在提交时强制执行

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "dt-sql-parser",
3-
"version": "4.5.0-beta.0",
3+
"version": "4.5.0-beta.1",
44
"authors": "DTStack Corporation",
55
"description": "SQL Parsers for BigData, built with antlr4",
66
"keywords": [

src/grammar/flink/FlinkSqlParser.g4

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -509,6 +509,7 @@ columnProjectItem
509509
| selectLiteralColumnName (columnAlias | KW_AS? expression)?
510510
| tableAllColumns columnAlias?
511511
| selectExpressionColumnName (columnAlias | KW_AS? columnName)?
512+
| {this.shouldMatchEmpty()}? emptyColumn
512513
;
513514

514515
selectWindowItemColumnName

src/grammar/hive/HiveSqlParser.g4

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1536,6 +1536,7 @@ selectItem
15361536
| KW_AS LPAREN alias=id_ (COMMA alias=id_)* RPAREN
15371537
)?
15381538
)
1539+
| {this.shouldMatchEmpty()}? emptyColumn
15391540
;
15401541

15411542
selectLiteralColumnName

src/grammar/impala/ImpalaSqlParser.g4

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -823,6 +823,7 @@ selectItem
823823
: selectLiteralColumnName columnAlias?
824824
| selectExpressionColumnName columnAlias?
825825
| tableAllColumns
826+
| {this.shouldMatchEmpty()}? emptyColumn
826827
;
827828

828829
columnAlias

src/grammar/mysql/MySqlParser.g4

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1205,9 +1205,10 @@ selectElements
12051205
;
12061206

12071207
selectElement
1208-
: tableAllColumns
1209-
| selectLiteralColumnName (KW_AS? alias=uid)?
1210-
| selectExpressionColumnName (KW_AS? alias=uid)?
1208+
: tableAllColumns # selectElement_star
1209+
| selectLiteralColumnName (KW_AS? alias=uid)? # selectElement_label
1210+
| selectExpressionColumnName (KW_AS? alias=uid)? # selectElement_expr
1211+
| uid DOT {this.shouldMatchEmpty()}? emptyColumn # selectElement_dot_empty
12111212
;
12121213

12131214
tableAllColumns
@@ -2424,7 +2425,7 @@ emptyColumn
24242425
;
24252426

24262427
columnName
2427-
: uid (dottedIdAllowEmpty dottedIdAllowEmpty?)?
2428+
: uid (dottedId dottedId?)?
24282429
| .? dottedId dottedId?
24292430
| {this.shouldMatchEmpty()}? emptyColumn
24302431
;
@@ -2436,7 +2437,7 @@ columnNamePath
24362437

24372438
columnNamePathAllowEmpty
24382439
: {this.shouldMatchEmpty()}? emptyColumn
2439-
| uid (dottedIdAllowEmpty dottedIdAllowEmpty?)?
2440+
| uid (dottedId dottedId?)?
24402441
;
24412442

24422443
tableSpaceNameCreate
@@ -2574,12 +2575,6 @@ dottedId
25742575
| '.' uid
25752576
;
25762577

2577-
dottedIdAllowEmpty
2578-
: DOT ID
2579-
| '.' uid
2580-
| {this.shouldMatchEmpty()}? DOT emptyColumn
2581-
;
2582-
25832578
decimalLiteral
25842579
: DECIMAL_LITERAL
25852580
| ZERO_DECIMAL

src/grammar/postgresql/PostgreSqlParser.g4

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2615,7 +2615,8 @@ when_clause
26152615
;
26162616

26172617
indirectionEl
2618-
: DOT (colLabel | STAR)
2618+
: DOT indirectionLabel
2619+
| DOT STAR
26192620
| OPEN_BRACKET (expression | expression? COLON expression?) CLOSE_BRACKET
26202621
;
26212622

@@ -2634,6 +2635,8 @@ targetList
26342635
targetEl
26352636
: tableAllColumns # target_star
26362637
| (selectLiteralColumnName | selectExpressionColumnName) (KW_AS? alias=identifier |) # target_label
2638+
| colId DOT {this.entityCollecting}? emptyColumn # target_dot_empty
2639+
| {this.entityCollecting}? emptyColumn # target_empty
26372640
;
26382641

26392642
tableAllColumns
@@ -2722,18 +2725,17 @@ procedureNameCreate
27222725
| colId indirection
27232726
;
27242727

2728+
// Empty column rule for entity collection
27252729
emptyColumn
27262730
:
27272731
;
27282732

27292733
columnName
27302734
: colId optIndirection
2731-
| {this.shouldMatchEmpty()}? (colId DOT emptyColumn | emptyColumn)
27322735
;
27332736

27342737
columnNamePath
27352738
: colId optIndirection
2736-
| {this.shouldMatchEmpty()}? (colId DOT emptyColumn | emptyColumn)
27372739
;
27382740

27392741
columnNameCreate
@@ -2800,6 +2802,12 @@ colLabel
28002802
| reservedKeyword
28012803
;
28022804

2805+
indirectionLabel
2806+
: identifier
2807+
| colNameKeyword
2808+
| typeFuncNameKeyword
2809+
;
2810+
28032811
identifier
28042812
: Identifier (KW_UESCAPE anysconst)?
28052813
| stringConst

src/grammar/spark/SparkSqlParser.g4

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -850,6 +850,7 @@ namedExpression
850850
: (tableAllColumns | selectLiteralColumnName | selectExpressionColumnName) (
851851
KW_AS? (alias=errorCapturingIdentifier | identifierList)
852852
)?
853+
| {this.shouldMatchEmpty()}? emptyColumn
853854
;
854855

855856
namedExpressionSeq

src/lib/SQLParserBase.ts

Lines changed: 46 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,50 @@ export abstract class SQLParserBase<T = antlr.ParserRuleContext> extends antlr.P
1010

1111
public entityCollecting = false;
1212

13-
public shouldMatchEmpty () {
14-
return this.entityCollecting
15-
&& (this.tokenStream.LT(-1)?.tokenIndex ?? Infinity) <= this.caretTokenIndex
16-
&& (this.tokenStream.LT(1)?.tokenIndex ?? -Infinity) >= this.caretTokenIndex
13+
/**
14+
* Semantic predicate to determine whether to match empty column.
15+
*
16+
* Key design:
17+
* 1. Only match empty column in entityCollecting mode
18+
* 2. Check if caret position is at the empty column position
19+
* 3. In validate mode (entityCollecting=false), this predicate returns false
20+
* and reports an error to ensure incomplete SQL is caught
21+
*
22+
* IMPORTANT: This predicate should be used carefully to avoid affecting
23+
* prediction in non-entity-collecting contexts.
24+
*/
25+
public shouldMatchEmpty (ruleName?: string) {
26+
// Only match in entityCollecting mode or when caret position is specified (suggestion mode)
27+
if (this.entityCollecting || this.caretTokenIndex >= 0) {
28+
// If no caret position specified, match all empty columns
29+
if (this.caretTokenIndex < 0) {
30+
return true;
31+
}
32+
33+
// Check if caret is at the position where empty column would be
34+
const prevTokenIndex = this.tokenStream.LT(-1)?.tokenIndex;
35+
const nextTokenIndex = this.tokenStream.LT(1)?.tokenIndex;
36+
37+
// Match if caret is between previous and next token
38+
if (prevTokenIndex !== undefined && nextTokenIndex !== undefined) {
39+
return prevTokenIndex <= this.caretTokenIndex && nextTokenIndex >= this.caretTokenIndex;
40+
}
41+
42+
// If only previous token exists, match if caret is after it
43+
if (prevTokenIndex !== undefined) {
44+
return prevTokenIndex <= this.caretTokenIndex;
45+
}
46+
47+
// If only next token exists, match if caret is before it
48+
if (nextTokenIndex !== undefined) {
49+
return nextTokenIndex >= this.caretTokenIndex;
50+
}
51+
52+
return false;
53+
}
54+
55+
// In pure validate mode, don't match empty columns
56+
// This allows ANTLR to report errors naturally
57+
return false;
1758
}
18-
}
59+
}

src/lib/flink/FlinkSqlParser.interp

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)