Skip to content

Latest commit

 

History

History
403 lines (299 loc) · 10.3 KB

File metadata and controls

403 lines (299 loc) · 10.3 KB

Getting Started

Overview

  • Setup
    • Installing dependencies
  • Project Layout
  • Lexing
    • Logos Introduction
    • Tokens
    • Span Information
    • Error Handling
  • AST
    • Expr & ExprKind
    • Span Information
  • Parsing
    • LALRPOP Introduction
    • Grammar File Overview
    • Using Our Lexer
  • Error Handling
    • Lexing Errors
    • Parsing Errors
    • ExpyrResult
    • Error Recovery
      • Collect All The Errors
      • AST Error Nodes
  • Testing
    • Specification Tests
      • Valid Syntax Tests
        • Expected Tokens
        • Expected AST
      • Invalid Syntax Tests
        • Expected Tokens
        • Expected Error
        • Expected Diagnostic

Setup

Installing Dependencies

Project Layout

This is where the abstract syntax tree structs and enums are defined. These are used in the grammar file and in the parser.

In this starter:

  • Expr: Representing expressions with a ExprKind and Range<usize> span.
  • ExprKind: The type/kind of expression: number, infix operation
  • ExprLiteral: Literal expression value like a number
  • ExprInfixOp: Infix operation expression like addition or subtraction

This is where the ExprError and ExprResult<T> are defined. This is also where diagnostics logic live.

This is where the language's grammar is defined. It uses LALRPOP, which is a Rust like DSL for defining grammars, to produce a parsing function that's used by the parser. It's configured to read in the tokens produced by the lexer

This is where the lexical tokens are defined. Logos is used for this so it will automatically generate a lexer.

This is where the parse function lives. It wraps the grammar file produced parsing function to return ExprResult.

This is where the Span and Spanned<T> utility types are defined.

Lexing

Lexing breaks text in to tokens e.g. 1 + 2 might become something like Number(1), Add, Number(2) with each of those being a token.

Lexing is handled by the Rust crate Logos. This project's tokens are defined in ./src/lexer.rs.

#[derive(Logos, Debug, Clone, PartialEq)]
#[logos(error = Spanned<LexicalError>)]
#[logos(skip r"[ \t\n\f]+")]
pub enum Token {
    #[token("(")]
    LParan,

    #[token(")")]
    RParan,

    #[token("+")]
    Plus,

    #[token("-")]
    Minus,

    #[token("/")]
    Slash,

    #[token("*")]
    Astrisk,

    #[regex(r#"[0-9]+(\.[0-9]+)?"#, lex_number)]
    Number(f64),
}

Lexing Function

The lex function takes in a string source and returns a Vec of token results Result<(usize, Token, usize), (ExprError, Span)>.

This function is a wrapper around the Lexer that Logos produces.

use language_project_template::prelude::*;

let source = "1 + 2";

type TokenWithSpan = (usize, Token, usize);
type ErrorWithSpan = (ExprError, Span);

let tokens: Vec<Result<TokenWithSpan, ErrorWithSpan>> = lex(&source);

A successfully lexed token might be Ok(0, Token::Number(1.0), 1) where 0 and 1 are the start and end of the token's span respectively.

A token that can't be parsed might be Err((ExprError::LexError(LexicalError::InvalidToken), 0..5)) where 0..5 is the span of the bad token.

Type

The result type of the lexer is defined in ./src/lexer.rs:

impl Iterator for Lexer<'_> {
  type Item = Result<(usize, Token, usize), ExprErrorS>;
}

Whitespace

This project ignores whitespace by default when lexing.

#[logos(skip r"[ \t\n\f]+")]

Removing this line in ./src/lexer.rs and adding token/s for whitespace is also possible. It depends on the needs of your language/parsing project.

Abstract Syntax Tree (AST)

The AST is a simplified representation of the original source code. It's made up of a tree of nodes. Each node having a "type" e.g. statement, expression, module item. What node types you choose and how simplified (or abstracted) it is will depend on the needs of your language project. Things like comments and whitespace are abstent from the AST.

The AST, at least in this project, is the first attempt to modal your language's grammar in to something usable in code.

What's It Used For?

ASTs can be used for code generation (IR, bytecode), static analysis, language servers, etc...

Expr

This project's AST is made up of single root Expr expression node. There are no statements since you can only represent single expressions like this (1 + (2 + 3)) in this project's syntax.

The Expr has a kind (ExprKind) and a span (Range<usize>) field.

new(kind: ExprKind, span: Range<usize>) -> Expr

Creates a new Expr with kind and span information.

kind(&self) -> &ExprKind

Returns the ExprKind the Expr.

span(&self) -> &Range<usize>

Returns the Range<usize> span of the Expr in the original source code.

start(&self) -> usize

Returns the Expr span start position in the original source code.

end(&self) -> usize

Returns the Expr span end position in the original source code.

ExprKind

Defines the "kind" of node the expression is. This project's syntax is simple so there's only two expression kinds plus an error kind.

#[derive(Debug, PartialEq, Clone)]
pub enum ExprKind {
    Literal(Box<ExprLiteral>),
    InfixOp(Box<ExprInfixOp>),
    Error,
}

#[derive(Clone, Debug, PartialEq)]
pub enum ExprLiteral {
    Number(f64),
}

#[derive(Clone, Debug, PartialEq)]
pub struct ExprInfixOp {
    pub lt: Box<Expr>,
    pub op: OpInfix,
    pub rt: Box<Expr>,
}

#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub enum OpInfix {
    Add,
    Subtract,
    Multiply,
    Divide,
    Modulus,
    Less,
    LessEqual,
    Greater,
    GreaterEqual,
    Equal,
    NotEqual,
    LogicAnd,
    LogicOr,
}

Parsing

Parsing maps lexical tokens in to an abstract syntax tree (AST). This is the first stage where the rules of your grammar are exercised and enforced.

LALRPOP

The parser is generated by LALRPOP from the grammar file.

Parsing Function

The parse function returns either the AST or a vector of parsing errors. It's a wrapper around the parser that LALRPOP generates.

use language_project_template::prelude::*;

let source = "1 + 2";

type ErrorWithSpan = (ExprError, Span);

let ast: Result<Expr, Vec<ErrorWithSpan>> = parse(&source);

Testing

Language projects tend to need a health mix of low (unit tests) to high (integration, e2e) scoped tests to be effective. As the grammar/syntax grows, the project expands, more syntax features needing to be mixed/matched together, more and more tests are needed to prevent breaking things.

Unit Tests

Unit tests live with the implementation code, the same as most Rust projects.

Specification Tests

Specification tests live in ./spec/ and are ran by ./tests/spec_tests.rs. They are split between valid and invalid syntax tests.

Valid Syntax Tests

The valid syntax tests consist of an input file NAME.expr along with expected results NAME.expr.tokens for the lexer and NAME.expr.ast for the parser. These are great for both catching bugs and helping identify where the problem is.

1 + 2
[
    Ok(
        (
            0,
            Number(
                1.0,
            ),
            1,
        ),
    ),
    Ok(
        (
            2,
            Plus,
            3,
        ),
    ),
    Ok(
        (
            4,
            Number(
                2.0,
            ),
            5,
        ),
    ),
]
Ok(
    Expr {
        kind: InfixOp(
            ExprInfixOp {
                lt: Expr {
                    kind: Literal(
                        Number(
                            1.0,
                        ),
                    ),
                    span: 0..1,
                },
                op: Add,
                rt: Expr {
                    kind: Literal(
                        Number(
                            2.0,
                        ),
                    ),
                    span: 4..5,
                },
            },
        ),
        span: 0..5,
    },
)

Invalid Syntax Tests

The invalid syntax tests consist of an input file NAME.expr along with expected results NAME.expr.tokens for the lexer and expected error cases NAME.expr.error & NAME.expr.diagnostics. These are great for testing integrations like the language server or the VS code extension.

(
[
    Ok(
        (
            0,
            LParan,
            1,
        ),
    ),
]
Err(
    [
        (
            SyntaxError(
                UnrecognizedEOF {
                    expected: [
                        "\"(\"",
                        "number",
                    ],
                },
            ),
            1..1,
        ),
    ],
)
[
    Diagnostic {
        severity: Error,
        code: Some(
            "syntax",
        ),
        message: "unexpected end of file; expected: [\"\\\"(\\\"\", \"number\"]",
        labels: [
            Label {
                style: Primary,
                file_id: 0,
                range: 1..1,
                message: "",
            },
        ],
        notes: [],
    },
]