First of all, let’s define what a parser is:
parse - 3. Computers. to analyze (a string of characters) in order to associate groups of characters with the syntactic units of the underlying grammar. (Dictionary.com)
And naturally a parser is something that performs a parse operation.
A combinator is the “…use of a [Higher Order Function] as an infix operator in a function-definition…” (Parser Combinator - Wikipedia)
Practically speaking, the technique means that we write small parser functions that do one little thing, such as detect only digits, and return the result and the remainder of the input.
We can then write combinator functions that take two parsers. The input is applied to the first parser, and the output of the first is (potentially) used in the second.
Given the two parsers
and 1
digit
and the two combinators 1
letter
(for sequence) and 1
<+>
(for alternative), this means that we can write a parser for 1
<|>
which is defined as a letter followed by a letter or a digit as:1
identity
1
identity = letter <+> (letter <|> digit)
Given the input
the general flow is (the specific implementation is slightly different):1
a2 the rest of the input
1
letter
returns the pair 1
('a', "2 the rest of the input")
1
<+>
extracts the second item and puts it through the next parser1
letter
is given 1
"2 the rest of the input"
which fails to detect a letter1
<|>
sees that 1
letter
has failed so passes the input onto 1
digit
1
digit
returns the pair 1
(2, "the rest of the input")
1
<+>
has a result for both sides, it bundles the two success up into a pair and pairs that with the remaining input giving 1
(('a',2), "the rest of the input")
If we were to try and write
using our previous technique, it would be far more cumbersome and much harder to read the intent of the code. As you can see, using combinators is a powerful and clean technique that makes the intent far more readable.1
identity
Okay, combinators are cool. So now that we know about this technique, let’s put it to use in our own compiler. We will be borrowing very heavily from Andersons paper Parsing with Haskell in the inital stages, starting from a blank slate and building up the ability to parse
s again. We will keep our 1
expression
data type.1
Expression
The first thing we need is a
type. As we saw above, it should return a pair, with the first item being the result, and the second being the unused portion of the input. However occasionally a parser may fail, and we need to know about that. So we will wrap our pair in a 1
Parser
. We won’t stipulate what type the result should be, as it may be a 1
Maybe
, a 1
Integer
etc. but the type of the parser will be the type of the success result.1
String
type Parser a = String -> Maybe (a, String)
Starting from the very basics again, we need a
that can parse a single character. We will need to start using type declarations on our parsers so that we can combine them all together.1
Parser
char :: Parser Char
char [] = Nothing
char (x:xs) = Just(x, xs)
Rather than define a digit parser that explicitly tests for digits, we will define it as:
digit :: Parser Char
digit = char <=> isDigit
We haven’t defined
yet, but this will be the first combinator we write. 1
<=>
is essentially a boolean test operator. This combinator will only return the result of the first Parser if, and only if, the result passes the boolean expression. You can think of it as the same as 1
<=>
.1
==
-- Given a parser and a predicate return the parser only if it satisfies the predicate.
infix 7 <=>
(<=>) :: Parser a -> (a -> Bool) -> Parser a
(parser <=> predicate) input =
case parser input of
Nothing -> Nothing
Just(a,rest) -> if (predicate a) then Just(a,rest) else Nothing
Because our parsers are a
we need to handle both the 1
Maybe
and 1
Nothing
cases. The 1
Just
case simply propogates the 1
Nothing
. The important part of this definition is in the 1
Nothing
case. It will return a 1
Just
containing the 1
Just
only if the value satifies the 1
parser
.1
predicate
Using this we can define a couple more parsers that will be useful a bit further down the track. These two should be quite clear in what they do:
space :: Parser Char
space = char <=> isSpace
letter :: Parser Char
letter = char <=> isAlpha
And one that will be quite useful for checking for the
or 1
+
in our 1
-
s:1
Expression
literal :: Char -> Parser Char
literal c = char <=> (==c)
This combinator will allow us to define two or more paths through a parse operation. An example utility parser should explain it clearly.
alphanum :: Parser Char
alphanum = digit <|> letter
I told you
would be useful. 1
letter
operates like an 1
<|>
, that is, the result of 1
or
ing two or more parsers together is the first one that returns a successful result.1
<|>
-- Combine two parsers using a 'or' type operation
infixl 3 <|>
(<|>) :: Parser a -> Parser a -> Parser a
(parserA <|> parserB) input =
case parserA input of
Nothing -> parserB input
result -> result
You might notice this time we used a left associative infix. This means that we can write
and it will logicaly return the first one that is a success.1
a <|> b <|> c
We also used a lower order of precedence - the higher the number the further down the order it sinks. This means that we can write expressions such as
which will equate to 1
a <=> b <|> c
.1
(a <=> b) <|> c
We now have enough parsers and combinators to be able to parse assignments of the form:
1
assignment ::= <letter>=<digit>
This will let us parse things such as
. Pretty basic. Let’s create an assignment type for this and a parser to recognise the form.1
a=1
data Assign = Assign Char Expression
deriving Show
assign :: Parser ((Char, Char), Char)
assign = letter <+> literal '=' <+> digit
Now that reads almost exactly like the definition of assignment. The combinator
is a left associative sequence operator. This will only apply the second parser if the first is succesful and will return the two successes combined in a pair. Multiple applications of 1
<+>
will create nested pairs.1
<+>
-- Combine two parser together pairing their results up in a tuple
infixl 6 <+>
(<+>) :: Parser a -> Parser b -> Parser (a, b)
(parserA <+> parserB) input =
case parserA input of
Nothing -> Nothing
Just (resultA, remainder) -> case parserB remainder of
Nothing -> Nothing
Just (resultB, cs) -> Just((resultA, resultB), cs)
We’ll also need to write a function that uses our new
parser to recognise a valid assignment and build an 1
assign
data type from the results.1
Assign
parse :: String -> Assign
parse s = Assign id expr
where (id, expr) = case assign s of
Nothing -> error "Invalid assignment"
Just (((a, _), b), _) -> (a, (Num b)) -- the underscores represent the '=' which we don't need to keep
Great, but that extra ‘=’ in the result of
which just gets thrown away all the time is annoying me. It makes more sense for 1
assign
to have the type 1
assign
so that it returns results in the form 1
Parser (Char, Char)
.1
(a,1)
parse :: String -> Assign
parse s = Assign id expr
where (id, expr) = case assign s of
Nothing -> error "Invalid assignment"
Just ((a, b), _) -> (a, Num b)
assign = letter <+-> literal '=' <+> digit
We’ve refactored the new result format into
, and the new combinator 1
parse
has been introduced into 1
<+->
. This combinator has the exact same semantics as 1
assign
only it discards the second result.1
<+>
-- Sequence operator that discards the second result
infixl 6 <+->
(<+-> ) :: Parser a -> Parser b -> Parser a
(parserA <+-> parserB) input =
case parserA input of
Nothing -> Nothing
Just (resultA, remainder) -> case parserB remainder of
Nothing -> Nothing
Just (_, cs) -> Just(resultA, cs)
And just for completeness, one that discards the first result and keeps only the second.
1
2
3
4
5
6
7
8
9
-- Sequence operator that discards the first result
infixl 6 <-+>
(<-+> ) :: Parser a -> Parser b -> Parser b
(parserA <-+> parserB) input =
case parserA input of
Nothing -> Nothing
Just (resultA, remainder) -> case parserB remainder of
Nothing -> Nothing
Just (resultB, cs) -> Just(resultB, cs)
We now have a the beginnings of a compiler that, while a bit longer than the previous attempt, is much simpler to read and definately far simpler to extend.
Let’s recap our current definition of an expression.
1
<expression> ::= <term>[<addOperation><term>]*
is defined as a single digit number and 1
term
is defined as either 1
addOperation
or 1
+
. Using our existing combinators and parsers we can define a new 1
-
.1
expression
Starting simple we will write an
parser that can only parse our single digits returning an 1
expression
using the 1
Expression
constructor.1
Num
expression :: Parser Expression
expression = digit >>> Num
We have a new combinator
that means “transformed by” or “applied to”. This combinator will apply the given function or transformation if the parser returns a non-empty result.1
>>>
-- Transform a parsers result
infixl 5 >>>
(>>>) :: Parser a -> (a -> b) -> Parser b
(parser >>> transformer) input =
case parser input of
Nothing -> Nothing
Just (resultA, remainder) -> Just ((transformer resultA), remainder)
We’ve chosen use a left associative infix operator and carefully chosen the weighting so that the result of combining sequences and alternatives will collectively be passed to the transformer.
Testing this in GHCi will show that instead of getting a basic
that 1
Just('1', "")
gives us we end up with 1
digit
.1
Just (Num '1',"")
This can be pushed down to a
function so that we can create a new 1
term
to handle binary addition or subtraction. The obvious solution is implied by the definition.1
expression
expression = term <+> addOp <+> term
addOp :: Parser Char
addOp = literal '+' <|> literal '-'
You might have notice that I left the type hint off
. This was intentional as the current definition won’t create a valid 1
expression
although it will create a valid result that looks like:1
Expression
1
2
*Main> expression "1+1"
Just (((Num '1','+'),Num '1'),"")
But we need to get that into something that looks like
. We can use our transform cobinator 1
Just ((Add (Num '1'), (Num '1')),"")
to pass the results to another function that will rip the previous result apart and build the format that we need.1
>>>
expression :: Parser Expression
expression = term <+> addOp <+> term >>> buildOp
buildOp ((termA, op), termB)
| op == '+' = Add termA termB
| op == '-' = Sub termA termB
| otherwise = expected "add operation"
In this definition
uses some guards to figure out which operation to build, and also has a call to 1
buildOp
. Through the wonders of partial and first class functions, we can actually rewrite 1
expected
to tell us which data constructor to use, which means 1
addOp
gets simplified and won’t have a chance to fail.1
buildOp
addOp :: Parser (Expression -> Expression -> Expression)
addOp = literal '+' >>> (\_ -> Add)
<|> literal '-' >>> (\_ -> Sub)
buildOp :: ((Expression, Expression -> Expression -> Expression), Expression) -> Expression
buildOp ((expressionA, op), expressionB) = op expressionA expressionB
Here you can see the careful selection of infix weighting on the combinators coming to fruition. In the definiton of
we use the alterantive and the transform combinators in a logical and clear to read manner.1
addOp
So that’s binary. We have the basic process so now let’s extend it to the n-ary version. We’ll take a different approach to the first attempt. What we want to do is find the first
and then pass that onto a subfunction that continuosly builds an 1
term
based on the parsing 1
Expression
.1
[<addOp><term>]
expression :: Parser Expression
expression = term +> expression'
expression' e = addOp <+> term >>> buildOp e +> expression'
<|> result e
This subfunction takes an
, parses out a 1
Expression
and another factor, passes the original 1
addOp
, the 1
Expression
and the new 1
addOp
into 1
term
and then sends this new 1
buildOp
through itself again in a recursive manner until it can no longer parse a 1
Expression
and 1
addOp
at which point it returns the 1
term
built so far.1
Expression
We’ve introduced a new combinator
which can be though of as a reduction. It will take everything on the left and apply the function on the right returning the raw result. This is similar to 1
+>
which applies the function and returns the wrapped result, whereas 1
>>>
will unwrap the result.1
+>
-- Extract a parsers result
infix 4 +>
(+>) :: Parser a -> (a -> Parser b) -> Parser b
(parser +> function) input =
case parser input of
Nothing -> Nothing
Just (a, cs) -> function a cs
Similarly
is a parser that simply returns what it is given wrapped up as a 1
result
type. This lets us do things such as define default alternatives, or in this case, return the accumulated result.1
Parser
result :: a -> Parser a
result a cs = Just(a,cs)
We also need to revisit
. This function now takes two parameters getting the new operation and factor as a pair. It’s a small shuffle so let’s quickly redefine it.1
buildOp
buildOp :: Expression -> ((Expression -> Expression -> Expression), Expression) -> Expression
buildOp expressionA (op, expressionB) = op expressionA expressionB
There are a couple of added benefits to defining
using the two parts. We solved the left associative issue because the operations are built up on the way throught the parsing instead of all at the end. Additionally we can now have single term expressions.1
expression
In the previous chapter we defined
using 1
assign
and had to use a data constructor directly in 1
digit
to get a valid 1
parse
. We can make some small changes to use any 1
Assign
on the right of 1
Expression
.1
Assign
parse :: String -> Assign
parse s = Assign id expr
where (id, expr) = case assign s of
Nothing -> error "Invalid assignment"
Just ((a, b), _) -> (a, b)
assign :: Parser (Char, Expression)
assign = letter <+-> literal '=' <+> expression
And now we can get some pretty good (and correct) results out of our parser.
1
2
3
4
*Main> parse "a=1"
Assign 'a' (Num '1')
*Main> parse "a=1+7-3"
Assign 'a' (Sub (Add (Num '1') (Num '7')) (Num '3'))