- Setup
- Installing dependencies
- Project Layout
- Lexing
- Logos Introduction
- Tokens
- Span Information
- Error Handling
- AST
- Expr & ExprKind
- Span Information
- Parsing
- LALRPOP Introduction
- Grammar File Overview
- Using Our Lexer
- Error Handling
- Lexing Errors
- Parsing Errors
- ExpyrResult
- Error Recovery
- Collect All The Errors
- AST Error Nodes
- Testing
- Specification Tests
- Valid Syntax Tests
- Expected Tokens
- Expected AST
- Invalid Syntax Tests
- Expected Tokens
- Expected Error
- Expected Diagnostic
- Valid Syntax Tests
- Specification Tests
This is where the abstract syntax tree structs and enums are defined. These are used in the grammar file and in the parser.
In this starter:
Expr: Representing expressions with aExprKindandRange<usize>span.ExprKind: The type/kind of expression: number, infix operationExprLiteral: Literal expression value like a numberExprInfixOp: Infix operation expression like addition or subtraction
This is where the ExprError and ExprResult<T> are defined. This is also where diagnostics logic live.
This is where the language's grammar is defined. It uses LALRPOP, which is a Rust like DSL for defining grammars, to produce a parsing function that's used by the parser. It's configured to read in the tokens produced by the lexer
This is where the lexical tokens are defined. Logos is used for this so it will automatically generate a lexer.
This is where the parse function lives. It wraps the grammar file produced parsing function to return ExprResult.
This is where the Span and Spanned<T> utility types are defined.
Lexing breaks text in to tokens e.g. 1 + 2 might become something like Number(1), Add, Number(2) with each of those being a token.
Lexing is handled by the Rust crate Logos. This project's tokens are defined in ./src/lexer.rs.
#[derive(Logos, Debug, Clone, PartialEq)]
#[logos(error = Spanned<LexicalError>)]
#[logos(skip r"[ \t\n\f]+")]
pub enum Token {
#[token("(")]
LParan,
#[token(")")]
RParan,
#[token("+")]
Plus,
#[token("-")]
Minus,
#[token("/")]
Slash,
#[token("*")]
Astrisk,
#[regex(r#"[0-9]+(\.[0-9]+)?"#, lex_number)]
Number(f64),
}The lex function takes in a string source and returns a Vec of token results Result<(usize, Token, usize), (ExprError, Span)>.
This function is a wrapper around the Lexer that Logos produces.
use language_project_template::prelude::*;
let source = "1 + 2";
type TokenWithSpan = (usize, Token, usize);
type ErrorWithSpan = (ExprError, Span);
let tokens: Vec<Result<TokenWithSpan, ErrorWithSpan>> = lex(&source);A successfully lexed token might be Ok(0, Token::Number(1.0), 1) where 0 and 1 are the start and end of the token's span respectively.
A token that can't be parsed might be Err((ExprError::LexError(LexicalError::InvalidToken), 0..5)) where 0..5 is the span of the bad token.
The result type of the lexer is defined in ./src/lexer.rs:
impl Iterator for Lexer<'_> {
type Item = Result<(usize, Token, usize), ExprErrorS>;
}This project ignores whitespace by default when lexing.
#[logos(skip r"[ \t\n\f]+")]Removing this line in ./src/lexer.rs and adding token/s for whitespace is also possible. It depends on the needs of your language/parsing project.
The AST is a simplified representation of the original source code. It's made up of a tree of nodes. Each node having a "type" e.g. statement, expression, module item. What node types you choose and how simplified (or abstracted) it is will depend on the needs of your language project. Things like comments and whitespace are abstent from the AST.
The AST, at least in this project, is the first attempt to modal your language's grammar in to something usable in code.
ASTs can be used for code generation (IR, bytecode), static analysis, language servers, etc...
This project's AST is made up of single root Expr expression node. There are no statements since you can only represent single expressions like this (1 + (2 + 3)) in this project's syntax.
The Expr has a kind (ExprKind) and a span (Range<usize>) field.
Creates a new Expr with kind and span information.
Returns the ExprKind the Expr.
Returns the Range<usize> span of the Expr in the original source code.
Returns the Expr span start position in the original source code.
Returns the Expr span end position in the original source code.
Defines the "kind" of node the expression is. This project's syntax is simple so there's only two expression kinds plus an error kind.
#[derive(Debug, PartialEq, Clone)]
pub enum ExprKind {
Literal(Box<ExprLiteral>),
InfixOp(Box<ExprInfixOp>),
Error,
}
#[derive(Clone, Debug, PartialEq)]
pub enum ExprLiteral {
Number(f64),
}
#[derive(Clone, Debug, PartialEq)]
pub struct ExprInfixOp {
pub lt: Box<Expr>,
pub op: OpInfix,
pub rt: Box<Expr>,
}
#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub enum OpInfix {
Add,
Subtract,
Multiply,
Divide,
Modulus,
Less,
LessEqual,
Greater,
GreaterEqual,
Equal,
NotEqual,
LogicAnd,
LogicOr,
}Parsing maps lexical tokens in to an abstract syntax tree (AST). This is the first stage where the rules of your grammar are exercised and enforced.
The parser is generated by LALRPOP from the grammar file.
The parse function returns either the AST or a vector of parsing errors. It's a wrapper around the parser that LALRPOP generates.
use language_project_template::prelude::*;
let source = "1 + 2";
type ErrorWithSpan = (ExprError, Span);
let ast: Result<Expr, Vec<ErrorWithSpan>> = parse(&source);Language projects tend to need a health mix of low (unit tests) to high (integration, e2e) scoped tests to be effective. As the grammar/syntax grows, the project expands, more syntax features needing to be mixed/matched together, more and more tests are needed to prevent breaking things.
Unit tests live with the implementation code, the same as most Rust projects.
Specification tests live in ./spec/ and are ran by ./tests/spec_tests.rs. They are split between valid and invalid syntax tests.
The valid syntax tests consist of an input file NAME.expr along with expected results NAME.expr.tokens for the lexer and NAME.expr.ast for the parser. These are great for both catching bugs and helping identify where the problem is.
1 + 2
[
Ok(
(
0,
Number(
1.0,
),
1,
),
),
Ok(
(
2,
Plus,
3,
),
),
Ok(
(
4,
Number(
2.0,
),
5,
),
),
]
Ok(
Expr {
kind: InfixOp(
ExprInfixOp {
lt: Expr {
kind: Literal(
Number(
1.0,
),
),
span: 0..1,
},
op: Add,
rt: Expr {
kind: Literal(
Number(
2.0,
),
),
span: 4..5,
},
},
),
span: 0..5,
},
)
The invalid syntax tests consist of an input file NAME.expr along with expected results NAME.expr.tokens for the lexer and expected error cases NAME.expr.error & NAME.expr.diagnostics. These are great for testing integrations like the language server or the VS code extension.
(
[
Ok(
(
0,
LParan,
1,
),
),
]
Err(
[
(
SyntaxError(
UnrecognizedEOF {
expected: [
"\"(\"",
"number",
],
},
),
1..1,
),
],
)
[
Diagnostic {
severity: Error,
code: Some(
"syntax",
),
message: "unexpected end of file; expected: [\"\\\"(\\\"\", \"number\"]",
labels: [
Label {
style: Primary,
file_id: 0,
range: 1..1,
message: "",
},
],
notes: [],
},
]