Getting Started

Overview

Setup

Installing Dependencies

Rust 1.88
Just - Task Runner
Docker Optional

Project Layout

`ast.rs`

This is where the abstract syntax tree structs and enums are defined. These are used in the grammar file and in the parser.

In this starter:

Expr: Representing expressions with a ExprKind and Range<usize> span.
ExprKind: The type/kind of expression: number, infix operation
ExprLiteral: Literal expression value like a number
ExprInfixOp: Infix operation expression like addition or subtraction

`errors.rs`

This is where the ExprError and ExprResult<T> are defined. This is also where diagnostics logic live.

This is where the language's grammar is defined. It uses LALRPOP, which is a Rust like DSL for defining grammars, to produce a parsing function that's used by the parser. It's configured to read in the tokens produced by the lexer

`lexer.rs`

This is where the lexical tokens are defined. Logos is used for this so it will automatically generate a lexer.

`parser.rs`

This is where the parse function lives. It wraps the grammar file produced parsing function to return ExprResult.

`span.rs`

This is where the Span and Spanned<T> utility types are defined.

Lexing

Lexing breaks text in to tokens e.g. 1 + 2 might become something like Number(1), Add, Number(2) with each of those being a token.

Lexing is handled by the Rust crate Logos. This project's tokens are defined in ./src/lexer.rs.

#[derive(Logos, Debug, Clone, PartialEq)]
#[logos(error = Spanned<LexicalError>)]
#[logos(skip r"[ \t\n\f]+")]
pub enum Token {
    #[token("(")]
    LParan,

    #[token(")")]
    RParan,

    #[token("+")]
    Plus,

    #[token("-")]
    Minus,

    #[token("/")]
    Slash,

    #[token("*")]
    Astrisk,

    #[regex(r#"[0-9]+(\.[0-9]+)?"#, lex_number)]
    Number(f64),
}

Lexing Function

The lex function takes in a string source and returns a Vec of token results Result<(usize, Token, usize), (ExprError, Span)>.

This function is a wrapper around the Lexer that Logos produces.

use language_project_template::prelude::*;

let source = "1 + 2";

type TokenWithSpan = (usize, Token, usize);
type ErrorWithSpan = (ExprError, Span);

let tokens: Vec<Result<TokenWithSpan, ErrorWithSpan>> = lex(&source);

A successfully lexed token might be Ok(0, Token::Number(1.0), 1) where 0 and 1 are the start and end of the token's span respectively.

A token that can't be parsed might be Err((ExprError::LexError(LexicalError::InvalidToken), 0..5)) where 0..5 is the span of the bad token.

Type

The result type of the lexer is defined in ./src/lexer.rs:

impl Iterator for Lexer<'_> {
  type Item = Result<(usize, Token, usize), ExprErrorS>;
}

Whitespace

This project ignores whitespace by default when lexing.

#[logos(skip r"[ \t\n\f]+")]

Removing this line in ./src/lexer.rs and adding token/s for whitespace is also possible. It depends on the needs of your language/parsing project.

Abstract Syntax Tree (AST)

The AST is a simplified representation of the original source code. It's made up of a tree of nodes. Each node having a "type" e.g. statement, expression, module item. What node types you choose and how simplified (or abstracted) it is will depend on the needs of your language project. Things like comments and whitespace are abstent from the AST.

The AST, at least in this project, is the first attempt to modal your language's grammar in to something usable in code.

What's It Used For?

ASTs can be used for code generation (IR, bytecode), static analysis, language servers, etc...

`Expr`

This project's AST is made up of single root Expr expression node. There are no statements since you can only represent single expressions like this (1 + (2 + 3)) in this project's syntax.

The Expr has a kind (ExprKind) and a span (Range<usize>) field.

`new(kind: ExprKind, span: Range<usize>) -> Expr`

Creates a new Expr with kind and span information.

`kind(&self) -> &ExprKind`

Returns the ExprKind the Expr.

`span(&self) -> &Range<usize>`

Returns the Range<usize> span of the Expr in the original source code.

`start(&self) -> usize`

Returns the Expr span start position in the original source code.

`end(&self) -> usize`

Returns the Expr span end position in the original source code.

ExprKind

Defines the "kind" of node the expression is. This project's syntax is simple so there's only two expression kinds plus an error kind.

#[derive(Debug, PartialEq, Clone)]
pub enum ExprKind {
    Literal(Box<ExprLiteral>),
    InfixOp(Box<ExprInfixOp>),
    Error,
}

#[derive(Clone, Debug, PartialEq)]
pub enum ExprLiteral {
    Number(f64),
}

#[derive(Clone, Debug, PartialEq)]
pub struct ExprInfixOp {
    pub lt: Box<Expr>,
    pub op: OpInfix,
    pub rt: Box<Expr>,
}

#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub enum OpInfix {
    Add,
    Subtract,
    Multiply,
    Divide,
    Modulus,
    Less,
    LessEqual,
    Greater,
    GreaterEqual,
    Equal,
    NotEqual,
    LogicAnd,
    LogicOr,
}

Parsing

Parsing maps lexical tokens in to an abstract syntax tree (AST). This is the first stage where the rules of your grammar are exercised and enforced.

LALRPOP

The parser is generated by LALRPOP from the grammar file.

Parsing Function

The parse function returns either the AST or a vector of parsing errors. It's a wrapper around the parser that LALRPOP generates.

use language_project_template::prelude::*;

let source = "1 + 2";

type ErrorWithSpan = (ExprError, Span);

let ast: Result<Expr, Vec<ErrorWithSpan>> = parse(&source);

Testing

Language projects tend to need a health mix of low (unit tests) to high (integration, e2e) scoped tests to be effective. As the grammar/syntax grows, the project expands, more syntax features needing to be mixed/matched together, more and more tests are needed to prevent breaking things.

Unit Tests

Unit tests live with the implementation code, the same as most Rust projects.

Specification Tests

Specification tests live in ./spec/ and are ran by ./tests/spec_tests.rs. They are split between valid and invalid syntax tests.

Valid Syntax Tests

The valid syntax tests consist of an input file NAME.expr along with expected results NAME.expr.tokens for the lexer and NAME.expr.ast for the parser. These are great for both catching bugs and helping identify where the problem is.

./spec/valid/add.expr

1 + 2

./spec/valid/add.expr.tokens

./spec/valid/add.expr.ast

Ok(
    Expr {
        kind: InfixOp(
            ExprInfixOp {
                lt: Expr {
                    kind: Literal(
                        Number(
                            1.0,
                        ),
                    ),
                    span: 0..1,
                },
                op: Add,
                rt: Expr {
                    kind: Literal(
                        Number(
                            2.0,
                        ),
                    ),
                    span: 4..5,
                },
            },
        ),
        span: 0..5,
    },
)

Invalid Syntax Tests

The invalid syntax tests consist of an input file NAME.expr along with expected results NAME.expr.tokens for the lexer and expected error cases NAME.expr.error & NAME.expr.diagnostics. These are great for testing integrations like the language server or the VS code extension.

./spec/invalid/unclosed_parans.expr

./spec/invalid/unclosed_parans.expr.tokens

[
    Ok(
        (
            0,
            LParan,
            1,
        ),
    ),
]

./spec/invalid/unclosed_parans.expr.error

Err(
    [
        (
            SyntaxError(
                UnrecognizedEOF {
                    expected: [
                        "\"(\"",
                        "number",
                    ],
                },
            ),
            1..1,
        ),
    ],
)

./spec/invalid/unclosed_parans.expr.diagnostics

[
    Diagnostic {
        severity: Error,
        code: Some(
            "syntax",
        ),
        message: "unexpected end of file; expected: [\"\\\"(\\\"\", \"number\"]",
        labels: [
            Label {
                style: Primary,
                file_id: 0,
                range: 1..1,
                message: "",
            },
        ],
        notes: [],
    },
]

FilesExpand file tree

GETTING_STARTED.md

Latest commit

History