Crate pest [−] [src]
pest. Elegant, efficient grammars
pest is a PEG parser generator with simplicity and speed in mind.
Input & Parser
pest works mainly through two traits: Input & Parser. Input is used to remember the
current position or index of the to-be-parsed input. It also knows how to match a &str or
char range against the current position. If it did in fact match, it advances the input by
incrementing the position.
let mut input = StringInput::new("asdasdf"); // Input mutates its position assert_eq!(input.pos(), 0); // matching starts from 0, before 'a' assert!(input.match_string("asd")); // Input::match_string matches "asd"; returns true assert_eq!(input.pos(), 3); // last match advances the parser by "asd".len() assert!(input.match_range('a', 'z')); // Input::match_range matches 'a'; returns true assert_eq!(input.pos(), 4); // last match advances the parser by 1
Input is also supposed to return a &str slice of its input by calling
Input::slice.
Parser gets constructed on top of an Input and delegates position access to
Parser::pos and
Parser::set_pos. Apart from this, Parser also gives access
to its Token queue and expected rules to match when it fails.
grammar!
The grammar! macro processes every rule and generates a method on a
Parser that returns whether the rule has matched its Input. grammar! can only be used
inside of impl_rdp! right now until other parser algorithms are
implemented.
Note: grammar! may require you to increase the recursion limit of your create with
#![recursion_limit = "*"] where * is the new limit.
When impl_rdp! is run, it implements an enum called Rule that has a value for all
non-silent rules, but also for
any and eoi. These Rules are used within Tokens to specify the type
of rule that matched. These Tokens are accesible from
Parser::queue after parsing. Instead of having the shape of
an AST, the Tokens come in a Vec in a predefined order that makes them easy to process.
impl_rdp! { grammar! { expression = _{ paren ~ expression? } // expression is silent so we'll only have parens paren = { ["("] ~ expression? ~ [")"] } } } let mut parser = Rdp::new(StringInput::new("(())()")); // ^--^ - Token { paren, 0, 4 }; queue[0] // ^^ - Token { paren, 1, 3 }; queue[1] // ^^ - Token { paren, 4, 6 }; queue[2] assert!(parser.expression()); assert!(parser.end()); let queue = vec![ Token::new(Rule::paren, 0, 4), Token::new(Rule::paren, 1, 3), Token::new(Rule::paren, 4, 6) ]; assert_eq!(parser.queue(), &queue);
Rules are also used for error reporting through
Parser::queue which is used when a Parser failed to parse
and you want to see what Rules it expected at the last possible position.
impl_rdp! { grammar! { expression = _{ paren ~ expression? } // expression is silent so we'll only have parens paren = { ["("] ~ expression? ~ [")"] } } } let mut parser = Rdp::new(StringInput::new("(())()foo")); // ^ - Parser should expect a paren at pos 6 assert!(parser.expression()); // the parser goes as deep as it can assert!(!parser.end()); // end is not reached, so the whole Input was not matched assert_eq!(parser.expected(), (vec![Rule::paren], 6));
Note: You can use the eoi rule instead of calling
Parser::end manually.
Calculator example
This example will concentrate on parsing and solving simple airthmetic with parens, additions, subtractions, multiplications, and divisions.
Let's start with defining a rule that matches integers. We first need to match an optional
"-".
impl_rdp! { grammar! { number = { ["-"]? } } } let mut parser = Rdp::new(StringInput::new("-")); assert!(parser.number()); assert!(parser.end());
In order to match a number, we should deal with two cases:
- number is
0 - number is any other number not starting with
0
impl_rdp! { grammar! { number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } // | | | | | | | | ^ zero or more // | | | | | | | ^ digit // | | | | | | ^ followed by // | | | | | ^ non-zero digit // | | | | ^ or // | | | ^ zero // | | ^ followed by // | ^ optional // ^ minus } } let mut parser = Rdp::new(StringInput::new("-90")); assert!(parser.number()); assert!(parser.end());
Now let's add operator rules.
impl_rdp! { grammar! { number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } plus = { ["+"] } minus = { ["-"] } times = { ["*"] } slash = { ["/"] } } }
Because infix precedence is hard to implement in PEG and quite inefficient, pest comes with a rule that implements precedence climbing.
impl_rdp! { grammar! { expression = { { number } // primary rule is a number addition = { plus | minus } // precedence 0 is addition multiplication = { times | slash } // precedence 1 is multiplication } number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } plus = { ["+"] } minus = { ["-"] } times = { ["*"] } slash = { ["/"] } } }
Before we go any further, let's see what parsing a number from an expression places on to
the queue.
let mut parser = Rdp::new(StringInput::new("-90")); assert!(parser.expression()); assert!(parser.end()); println!("{:?}", parser.queue()); // [Token { rule: expression, start: 0, end: 3 }, // Token { rule: number, start: 0, end: 3 }]
Since we're already parsing an expression and don't care about its length, we can make it
silent.
impl_rdp! { grammar! { expression = _{ // the underscore tells pest that this rule is silent { number } addition = { plus | minus } multiplication = { times | slash } } number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } plus = { ["+"] } minus = { ["-"] } times = { ["*"] } slash = { ["/"] } } } let mut parser = Rdp::new(StringInput::new("-90")); assert!(parser.expression()); assert!(parser.end()); let queue = vec![ Token::new(Rule::number, 0, 3) ]; assert_eq!(parser.queue(), &queue);
Adding parens to the whole business is as easy as adding a paren rule to the primary rule of the precedence climber.
impl_rdp! { grammar! { expression = _{ { ["("] ~ expression ~ [")"] | number } addition = { plus | minus } multiplication = { times | slash } } number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } plus = { ["+"] } minus = { ["-"] } times = { ["*"] } slash = { ["/"] } } } let mut parser = Rdp::new(StringInput::new("((-90))")); assert!(parser.expression()); assert!(parser.end()); let queue = vec![ Token::new(Rule::number, 2, 5) ]; assert_eq!(parser.queue(), &queue);
Before we get to the processing of the Tokens, let's also add white-space to the grammar.
impl_rdp! { grammar! { expression = _{ { ["("] ~ expression ~ [")"] | number } addition = { plus | minus } multiplication = { times | slash } } number = { ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } plus = { ["+"] } minus = { ["-"] } times = { ["*"] } slash = { ["/"] } whitespace = _{ [" "] } } } let mut parser = Rdp::new(StringInput::new("2 + 2")); assert!(parser.expression()); assert!(parser.end());
But now trying to parse "9 9" will work and recognize it as a number.
let mut parser = Rdp::new(StringInput::new("9 9")); assert!(parser.expression()); assert!(parser.end()); let queue = vec![ Token::new(Rule::number, 0, 3) ]; assert_eq!(parser.queue(), &queue);
To solve this issue, we make number atomic, stopping any
white-space matching inside of the rule.
impl_rdp! { grammar! { expression = _{ { ["("] ~ expression ~ [")"] | number } addition = { plus | minus } multiplication = { times | slash } } number = @{ ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } plus = { ["+"] } minus = { ["-"] } times = { ["*"] } slash = { ["/"] } whitespace = _{ [" "] } } } let mut parser = Rdp::new(StringInput::new("9 9")); assert!(parser.expression()); assert!(!parser.end());
To process all these Tokens we'll use the process! macro. This macro
defines matcher methods on the Parser that pattern match against a set of Tokens from the
front of the queue, calling themselves recursively until everything matched and returned its
result.
Let's start by defining the signature of the compute matcher that will do the computation. We
need it to return an i32 in the end.
compute(&self) -> i32
Now all we need to do is to write the three cases of interest, namely number, addition, and
multiplication. number is captured (its &str is being sliced from the Input) with the
& pattern and then parsed to an i32.
(&number: number) => number.parse::<i32>().unwrap()
addition and multiplication are virtually identical:
- match the
addition/multiplicationTokenwithout using it - recursively process the left-hand-side by calling
compute - use the
signTokenwithout capturing its&strvalue - recursively process the right-hand-side by calling
compute - match
signinside the block and return the appropriate result
(_: addition, left: compute(), sign, right: compute()) => { match sign.rule { Rule::plus => left + right, Rule::minus => left - right, _ => unreachable!() } }, (_: multiplication, left: compute(), sign, right: compute()) => { match sign.rule { Rule::times => left * right, Rule::slash => left / right, _ => unreachable!() } }
The reason we're matching sign manually inside of the block is because using _: plus and
_: minus will cause left: main() to be run twice in case the first rule fails. Caching the
result in this case is non-trivial apart from the fact that duplicate complex patterns are not
necessarily easier to read.
Now for the whole example:
impl_rdp! { grammar! { expression = _{ { ["("] ~ expression ~ [")"] | number } addition = { plus | minus } multiplication = { times | slash } } number = @{ ["-"]? ~ (["0"] | ['1'..'9'] ~ ['0'..'9']*) } plus = { ["+"] } minus = { ["-"] } times = { ["*"] } slash = { ["/"] } whitespace = _{ [" "] } } process! { compute(&self) -> i32 { (&number: number) => number.parse::<i32>().unwrap(), (_: addition, left: compute(), sign, right: compute()) => { match sign.rule { Rule::plus => left + right, Rule::minus => left - right, _ => unreachable!() } }, (_: multiplication, left: compute(), sign, right: compute()) => { match sign.rule { Rule::times => left * right, Rule::slash => left / right, _ => unreachable!() } } } } } let mut parser = Rdp::new(StringInput::new("(3 + (9 + 3 * 4 + (3 + 1) / 2 - 4)) * 2")); assert!(parser.expression()); assert_eq!(parser.compute(), 44);
Modules
| prelude |
A |
Macros
| grammar |
A |
| impl_rdp |
A |
| process |
A |
Structs
| StringInput |
A |
| Token |
A |
Traits
| Input |
A |
| Parser |
A |