Crate pest [−] [src]

pest. Elegant, efficient grammars

pest is a PEG parser generator with simplicity and speed in mind.

Input & Parser

pest works mainly through two traits: Input & Parser. Input is used to remember the current position or index of the to-be-parsed input. It also knows how to match a &str or char range against the current position. If it did in fact match, it advances the input by incrementing the position.

let mut input = StringInput::new("asdasdf"); // Input mutates its position

assert_eq!(input.pos(), 0);                  // matching starts from 0, before 'a'

assert!(input.match_string("asd"));          // Input::match_string matches "asd"; returns true

assert_eq!(input.pos(), 3);                  // last match advances the parser by "asd".len()

assert!(input.match_range('a', 'z'));        // Input::match_range matches 'a'; returns true

assert_eq!(input.pos(), 4);                  // last match advances the parser by 1

Input is also supposed to return a &str slice of its input by calling Input::slice.

Parser gets constructed on top of an Input and delegates position access to Parser::pos and Parser::set_pos. Apart from this, Parser also gives access to its Token queue and expected rules to match when it fails.

grammar!

The grammar! macro processes every rule and generates a method on a Parser that returns whether the rule has matched its Input. grammar! can only be used inside of impl_rdp! right now until other parser algorithms are implemented.

Note: grammar! may require you to increase the recursion limit of your create with #![recursion_limit = "*"] where * is the new limit.

When impl_rdp! is run, it implements an enum called Rule that has a value for all non-silent rules, but also for any and eoi. These Rules are used within Tokens to specify the type of rule that matched. These Tokens are accesible from Parser::queue after parsing. Instead of having the shape of an AST, the Tokens come in a Vec in a predefined order that makes them easy to process.

impl_rdp! {
    grammar! {
        expression = _{ paren ~ expression? } // expression is silent so we'll only have parens
        paren      =  { ["("] ~ expression? ~ [")"] }
    }
}

let mut parser = Rdp::new(StringInput::new("(())()"));
                                        //  ^--^   - Token { paren, 0, 4 }; queue[0]
                                        //   ^^    - Token { paren, 1, 3 }; queue[1]
                                        //      ^^ - Token { paren, 4, 6 }; queue[2]

assert!(parser.expression());
assert!(parser.end());

let queue = vec![
    Token::new(Rule::paren, 0, 4),
    Token::new(Rule::paren, 1, 3),
    Token::new(Rule::paren, 4, 6)
];

assert_eq!(parser.queue(), &queue);

Rules are also used for error reporting through Parser::queue which is used when a Parser failed to parse and you want to see what Rules it expected at the last possible position.

impl_rdp! {
    grammar! {
        expression = _{ paren ~ expression? } // expression is silent so we'll only have parens
        paren      =  { ["("] ~ expression? ~ [")"] }
    }
}

let mut parser = Rdp::new(StringInput::new("(())()foo"));
                                        //        ^   - Parser should expect a paren at pos 6

assert!(parser.expression()); // the parser goes as deep as it can
assert!(!parser.end());       // end is not reached, so the whole Input was not matched

assert_eq!(parser.expected(), (vec![Rule::paren], 6));

Note: You can use the eoi rule instead of calling Parser::end manually.

Calculator example

This example will concentrate on parsing and solving simple airthmetic with parens, additions, subtractions, multiplications, and divisions.

Let's start with defining a rule that matches integers. We first need to match an optional "-".

impl_rdp! {
    grammar! {
        number = { ["-"]? }
    }
}

let mut parser = Rdp::new(StringInput::new("-"));

assert!(parser.number());
assert!(parser.end());

In order to match a number, we should deal with two cases:

number is 0
number is any other number not starting with 0

impl_rdp! {
    grammar! {
        number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
               //  |    | |  |     | |          | |         ^ zero or more
               //  |    | |  |     | |          | ^ digit
               //  |    | |  |     | |          ^ followed by
               //  |    | |  |     | ^ non-zero digit
               //  |    | |  |     ^ or
               //  |    | |  ^ zero
               //  |    | ^ followed by
               //  |    ^ optional
               //  ^ minus
    }
}

let mut parser = Rdp::new(StringInput::new("-90"));

assert!(parser.number());
assert!(parser.end());

Now let's add operator rules.

impl_rdp! {
    grammar! {
        number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
        plus   = { ["+"] }
        minus  = { ["-"] }
        times  = { ["*"] }
        slash  = { ["/"] }
    }
}

Because infix precedence is hard to implement in PEG and quite inefficient, pest comes with a rule that implements precedence climbing.

impl_rdp! {
    grammar! {
        expression = {
            { number }                         // primary rule is a number
            addition       = { plus  | minus } // precedence 0 is addition
            multiplication = { times | slash } // precedence 1 is multiplication
        }
        number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
        plus   = { ["+"] }
        minus  = { ["-"] }
        times  = { ["*"] }
        slash  = { ["/"] }
    }
}

Before we go any further, let's see what parsing a number from an expression places on to the queue.

let mut parser = Rdp::new(StringInput::new("-90"));

assert!(parser.expression());
assert!(parser.end());

println!("{:?}", parser.queue()); // [Token { rule: expression, start: 0, end: 3 },
                                  //  Token { rule: number, start: 0, end: 3 }]

Since we're already parsing an expression and don't care about its length, we can make it silent.

impl_rdp! {
    grammar! {
        expression = _{ // the underscore tells pest that this rule is silent
            { number }
            addition       = { plus  | minus }
            multiplication = { times | slash }
        }
        number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
        plus   = { ["+"] }
        minus  = { ["-"] }
        times  = { ["*"] }
        slash  = { ["/"] }
    }
}

let mut parser = Rdp::new(StringInput::new("-90"));

assert!(parser.expression());
assert!(parser.end());

let queue = vec![
    Token::new(Rule::number, 0, 3)
];

assert_eq!(parser.queue(), &queue);

Adding parens to the whole business is as easy as adding a paren rule to the primary rule of the precedence climber.

impl_rdp! {
    grammar! {
        expression = _{
            { ["("] ~ expression ~ [")"] | number }
            addition       = { plus  | minus }
            multiplication = { times | slash }
        }
        number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
        plus   = { ["+"] }
        minus  = { ["-"] }
        times  = { ["*"] }
        slash  = { ["/"] }
    }
}

let mut parser = Rdp::new(StringInput::new("((-90))"));

assert!(parser.expression());
assert!(parser.end());

let queue = vec![
    Token::new(Rule::number, 2, 5)
];

assert_eq!(parser.queue(), &queue);

Before we get to the processing of the Tokens, let's also add white-space to the grammar.

impl_rdp! {
    grammar! {
        expression = _{
            { ["("] ~ expression ~ [")"] | number }
            addition       = { plus  | minus }
            multiplication = { times | slash }
        }
        number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
        plus   = { ["+"] }
        minus  = { ["-"] }
        times  = { ["*"] }
        slash  = { ["/"] }

        whitespace = _{ [" "] }
    }
}

let mut parser = Rdp::new(StringInput::new("2 + 2"));

assert!(parser.expression());
assert!(parser.end());

But now trying to parse "9 9" will work and recognize it as a number.

let mut parser = Rdp::new(StringInput::new("9 9"));

assert!(parser.expression());
assert!(parser.end());

let queue = vec![
    Token::new(Rule::number, 0, 3)
];

assert_eq!(parser.queue(), &queue);

To solve this issue, we make number atomic, stopping any white-space matching inside of the rule.

impl_rdp! {
    grammar! {
        expression = _{
            { ["("] ~ expression ~ [")"] | number }
            addition       = { plus  | minus }
            multiplication = { times | slash }
        }
        number = @{ ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
        plus   =  { ["+"] }
        minus  =  { ["-"] }
        times  =  { ["*"] }
        slash  =  { ["/"] }

        whitespace = _{ [" "] }
    }
}

let mut parser = Rdp::new(StringInput::new("9 9"));

assert!(parser.expression());
assert!(!parser.end());

To process all these Tokens we'll use the process! macro. This macro defines matcher methods on the Parser that pattern match against a set of Tokens from the front of the queue, calling themselves recursively until everything matched and returned its result.

Let's start by defining the signature of the compute matcher that will do the computation. We need it to return an i32 in the end.

compute(&self) -> i32

Now all we need to do is to write the three cases of interest, namely number, addition, and multiplication. number is captured (its &str is being sliced from the Input) with the & pattern and then parsed to an i32.

(&number: number) => number.parse::<i32>().unwrap()

addition and multiplication are virtually identical:

match the addition/multiplication Token without using it
recursively process the left-hand-side by calling compute
use the sign Token without capturing its &str value
recursively process the right-hand-side by calling compute
match sign inside the block and return the appropriate result

(_: addition, left: compute(), sign, right: compute()) => {
    match sign.rule {
        Rule::plus  => left + right,
        Rule::minus => left - right,
        _ => unreachable!()
    }
},
(_: multiplication, left: compute(), sign, right: compute()) => {
    match sign.rule {
        Rule::times => left * right,
        Rule::slash => left / right,
        _ => unreachable!()
    }
}

The reason we're matching sign manually inside of the block is because using _: plus and _: minus will cause left: main() to be run twice in case the first rule fails. Caching the result in this case is non-trivial apart from the fact that duplicate complex patterns are not necessarily easier to read.

Now for the whole example:

impl_rdp! {
    grammar! {
        expression = _{
            { ["("] ~ expression ~ [")"] | number }
            addition       = { plus  | minus }
            multiplication = { times | slash }
        }
        number = @{ ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) }
        plus   =  { ["+"] }
        minus  =  { ["-"] }
        times  =  { ["*"] }
        slash  =  { ["/"] }

        whitespace = _{ [" "] }
    }

    process! {
        compute(&self) -> i32 {
            (&number: number) => number.parse::<i32>().unwrap(),
            (_: addition, left: compute(), sign, right: compute()) => {
                match sign.rule {
                    Rule::plus  => left + right,
                    Rule::minus => left - right,
                    _ => unreachable!()
                }
            },
            (_: multiplication, left: compute(), sign, right: compute()) => {
                match sign.rule {
                    Rule::times => left * right,
                    Rule::slash => left / right,
                    _ => unreachable!()
                }
            }
        }
    }
}

let mut parser = Rdp::new(StringInput::new("(3 + (9 + 3 * 4 + (3 + 1) / 2 - 4)) * 2"));

assert!(parser.expression());
assert_eq!(parser.compute(), 44);

Modules

prelude

A mod that contains pest::Input, pest::Parser, pest::StringInput, and pest::Token.

Macros

grammar	A `macro` that defines each rule as a method on a `Parser` which parses from the current position. Rules are always defined between braces, with an optional symbol marking the type of rule defined.
impl_rdp	A `macro` useful for implementing the `Parser` `trait` as a recursive descent parser. It only accepts `grammar!` and `process!` calls that get implemented on `self`.
process	A `macro` for pattern-matching queued `Token`s generated by a `Parser`. It generates a method `process` on `&self` that processes the whole queue of `Token`s, reducing it to one single result.

Structs

StringInput	A `struct` useful for matching in-memory `String`s.
Token	A `struct` representing tokens generated by a parser.

Traits

Input	A `trait` that defines an input for a `Parser`.
Parser	A `trait` that defines a parser.