Parser Combinators

First of all, let’s define what a parser is:

parse - 3. Computers. to analyze (a string of characters) in order to associate groups of characters with the syntactic units of the underlying grammar. (Dictionary.com)

And naturally a parser is something that performs a parse operation.

A combinator is the “…use of a [Higher Order Function] as an infix operator in a function-definition…” (Parser Combinator - Wikipedia)

Practically speaking, the technique means that we write small parser functions that do one little thing, such as detect only digits, and return the result and the remainder of the input.

We can then write combinator functions that take two parsers. The input is applied to the first parser, and the output of the first is (potentially) used in the second.

Given the two parsers

digit

and

1
letter

and the two combinators

1
<+>

(for sequence) and

1
<|>

(for alternative), this means that we can write a parser for

1
identity

which is defined as a letter followed by a letter or a digit as:

identity = letter <+> (letter <|> digit)

Given the input

1	a2 the rest of the input

the general flow is (the specific implementation is slightly different):

1
letter

returns the pair

1
('a', "2 the rest of the input")

1 <+> extracts the second item and puts it through the next parser
1 letter is given 1 "2 the rest of the input" which fails to detect a letter
1 <|> sees that 1 letter has failed so passes the input onto 1 digit
1 digit returns the pair 1 (2, "the rest of the input")
Now that 1 <+> has a result for both sides, it bundles the two success up into a pair and pairs that with the remaining input giving 1 (('a',2), "the rest of the input")

If we were to try and write

identity

using our previous technique, it would be far more cumbersome and much harder to read the intent of the code. As you can see, using combinators is a powerful and clean technique that makes the intent far more readable.

Starting again

Okay, combinators are cool. So now that we know about this technique, let’s put it to use in our own compiler. We will be borrowing very heavily from Andersons paper Parsing with Haskell in the inital stages, starting from a blank slate and building up the ability to parse

1	expression

s again. We will keep our

1
Expression

data type.

The first thing we need is a

Parser

type. As we saw above, it should return a pair, with the first item being the result, and the second being the unused portion of the input. However occasionally a parser may fail, and we need to know about that. So we will wrap our pair in a

1
Maybe

. We won’t stipulate what type the result should be, as it may be a

1
Integer

, a

1
String

etc. but the type of the parser will be the type of the success result.

type Parser a = String -> Maybe (a, String)

Starting from the very basics again, we need a

Parser

that can parse a single character. We will need to start using type declarations on our parsers so that we can combine them all together.

char :: Parser Char
char [] = Nothing
char (x:xs) = Just(x, xs)

Our first combinator

Rather than define a digit parser that explicitly tests for digits, we will define it as:

digit :: Parser Char
digit = char <=> isDigit

We haven’t defined

<=>

yet, but this will be the first combinator we write.

1
<=>

is essentially a boolean test operator. This combinator will only return the result of the first Parser if, and only if, the result passes the boolean expression. You can think of it as the same as

1
==

-- Given a parser and a predicate return the parser only if it satisfies the predicate.
infix 7 <=> 
(<=>) :: Parser a -> (a -> Bool) -> Parser a 
(parser <=> predicate) input =
    case parser input of 
        Nothing       -> Nothing 
        Just(a,rest)  -> if (predicate a) then Just(a,rest) else Nothing

Because our parsers are a

Maybe

we need to handle both the

1
Nothing

and

1
Just

cases. The

1
Nothing

case simply propogates the

1
Nothing

. The important part of this definition is in the

1
Just

case. It will return a

1
Just

containing the

1
parser

only if the value satifies the

1
predicate

Using this we can define a couple more parsers that will be useful a bit further down the track. These two should be quite clear in what they do:

space :: Parser Char
space = char <=> isSpace    

letter :: Parser Char
letter = char <=> isAlpha

And one that will be quite useful for checking for the

1
-

in our

1
Expression

literal :: Char -> Parser Char
literal c = char <=> (==c)

Alternatives Combinator

This combinator will allow us to define two or more paths through a parse operation. An example utility parser should explain it clearly.

alphanum :: Parser Char
alphanum = digit <|> letter

I told you

letter

would be useful.

1
<|>

operates like an

1
or

, that is, the result of

1
<|>

ing two or more parsers together is the first one that returns a successful result.

-- Combine two parsers using a 'or' type operation        
infixl 3 <|>
(<|>) :: Parser a -> Parser a -> Parser a 
(parserA <|> parserB) input = 
	case parserA input of 
    	Nothing -> parserB input 
    	result -> result

You might notice this time we used a left associative infix. This means that we can write

1	a <\|> b <\|> c

and it will logicaly return the first one that is a success.

We also used a lower order of precedence - the higher the number the further down the order it sinks. This means that we can write expressions such as

1	a <=> b <\|> c

which will equate to

1
(a <=> b) <|> c

Basic assignment

We now have enough parsers and combinators to be able to parse assignments of the form:

1	assignment ::= <letter>=<digit>

This will let us parse things such as

a=1

. Pretty basic. Let’s create an assignment type for this and a parser to recognise the form.

data Assign = Assign Char Expression 
	deriving Show

assign :: Parser ((Char, Char), Char)
assign = letter <+> literal '=' <+> digit

Now that reads almost exactly like the definition of assignment. The combinator

<+>

is a left associative sequence operator. This will only apply the second parser if the first is succesful and will return the two successes combined in a pair. Multiple applications of

1
<+>

will create nested pairs.

-- Combine two parser together pairing their results up in a tuple
infixl 6 <+>
(<+>) :: Parser a -> Parser b -> Parser (a, b)
(parserA <+> parserB) input = 
	case parserA input of
	    Nothing -> Nothing
	    Just (resultA, remainder) -> case parserB remainder of
	        Nothing -> Nothing
        	Just (resultB, cs) -> Just((resultA, resultB), cs)

We’ll also need to write a function that uses our new

assign

parser to recognise a valid assignment and build an

1
Assign

data type from the results.

parse :: String -> Assign
parse s = Assign id expr
    where (id, expr) = case assign s of 
            Nothing -> error "Invalid assignment"
            Just (((a, _), b), _) -> (a, (Num b)) -- the underscores represent the '=' which we don't need to keep

Great, but that extra ‘=’ in the result of

assign

which just gets thrown away all the time is annoying me. It makes more sense for

1
assign

to have the type

1
Parser (Char, Char)

so that it returns results in the form

1
(a,1)

parse :: String -> Assign
parse s = Assign id expr
    where (id, expr) = case assign s of 
            Nothing -> error "Invalid assignment"
            Just ((a, b), _) -> (a, Num b)
                   
assign = letter <+-> literal '=' <+> digit

We’ve refactored the new result format into

parse

, and the new combinator

1
<+->

has been introduced into

1
assign

. This combinator has the exact same semantics as

1
<+>

only it discards the second result.

-- Sequence operator that discards the second result 
infixl 6 <+-> 
(<+-> ) :: Parser a -> Parser b -> Parser a
(parserA <+->  parserB) input = 
	case parserA input of
	    Nothing -> Nothing
	    Just (resultA, remainder) -> case parserB remainder of
	        Nothing -> Nothing
	        Just (_, cs) -> Just(resultA, cs)

And just for completeness, one that discards the first result and keeps only the second.

-- Sequence operator that discards the first result       
infixl 6 <-+> 
(<-+> ) :: Parser a -> Parser b -> Parser b
(parserA <-+>  parserB) input = 
	case parserA input of
	    Nothing -> Nothing
	    Just (resultA, remainder) -> case parserB remainder of
	        Nothing -> Nothing
	        Just (resultB, cs) -> Just(resultB, cs) 

We now have a the beginnings of a compiler that, while a bit longer than the previous attempt, is much simpler to read and definately far simpler to extend.

Combinatorial Expressions

Let’s recap our current definition of an expression.

1	<expression> ::= <term>[<addOperation><term>]*

term

is defined as a single digit number and

1
addOperation

is defined as either

1
+

1
-

. Using our existing combinators and parsers we can define a new

1
expression

Starting simple we will write an

1	expression

parser that can only parse our single digits returning an

1
Expression

using the

1
Num

constructor.

expression :: Parser Expression
expression = digit >>> Num

We have a new combinator

>>>

that means “transformed by” or “applied to”. This combinator will apply the given function or transformation if the parser returns a non-empty result.

-- Transform a parsers result
infixl 5 >>>
(>>>) :: Parser a -> (a -> b) -> Parser b
(parser >>> transformer) input = 
	case parser input of
		Nothing -> Nothing
		Just (resultA, remainder) -> Just ((transformer resultA), remainder)

We’ve chosen use a left associative infix operator and carefully chosen the weighting so that the result of combining sequences and alternatives will collectively be passed to the transformer.

Testing this in GHCi will show that instead of getting a basic

1	Just('1', "")

that

1
digit

gives us we end up with

1
Just (Num '1',"")

This can be pushed down to a

term

function so that we can create a new

1
expression

to handle binary addition or subtraction. The obvious solution is implied by the definition.

expression = term <+> addOp <+> term 

addOp :: Parser Char
addOp = literal '+' <|> literal '-'

You might have notice that I left the type hint off

1	expression

. This was intentional as the current definition won’t create a valid

1
Expression

although it will create a valid result that looks like:

*Main> expression "1+1"
Just (((Num '1','+'),Num '1'),"")

But we need to get that into something that looks like

1	Just ((Add (Num '1'), (Num '1')),"")

. We can use our transform cobinator

1
>>>

to pass the results to another function that will rip the previous result apart and build the format that we need.

expression :: Parser Expression
expression = term <+> addOp <+> term >>> buildOp

buildOp ((termA, op), termB)
	| op == '+' = Add termA termB
	| op == '-' = Sub termA termB
	| otherwise = expected "add operation"

In this definition

buildOp

uses some guards to figure out which operation to build, and also has a call to

1
expected

. Through the wonders of partial and first class functions, we can actually rewrite

1
addOp

to tell us which data constructor to use, which means

1
buildOp

gets simplified and won’t have a chance to fail.


addOp :: Parser (Expression -> Expression -> Expression)
addOp = literal '+' >>> (\_ -> Add)
    <|> literal '-' >>> (\_ -> Sub)

buildOp :: ((Expression, Expression -> Expression -> Expression), Expression) -> Expression
buildOp ((expressionA, op), expressionB) = op expressionA expressionB

Here you can see the careful selection of infix weighting on the combinators coming to fruition. In the definiton of

addOp

we use the alterantive and the transform combinators in a logical and clear to read manner.

Beyond binary

So that’s binary. We have the basic process so now let’s extend it to the n-ary version. We’ll take a different approach to the first attempt. What we want to do is find the first

term

and then pass that onto a subfunction that continuosly builds an

1
Expression

based on the parsing

1
[<addOp><term>]

expression :: Parser Expression
expression = term +> expression'
expression' e = addOp <+> term >>> buildOp e +> expression'
            <|> result e

This subfunction takes an

1	Expression

, parses out a

1
addOp

and another factor, passes the original

1
Expression

, the

1
addOp

and the new

1
term

into

1
buildOp

and then sends this new

1
Expression

through itself again in a recursive manner until it can no longer parse a

1
addOp

and

1
term

at which point it returns the

1
Expression

built so far.

We’ve introduced a new combinator

+>

which can be though of as a reduction. It will take everything on the left and apply the function on the right returning the raw result. This is similar to

1
>>>

which applies the function and returns the wrapped result, whereas

1
+>

will unwrap the result.

-- Extract a parsers result	
infix 4 +>
(+>) :: Parser a -> (a -> Parser b) -> Parser b
(parser +> function) input = 
	case parser input of
		Nothing -> Nothing
		Just (a, cs) -> function a cs

Similarly

result

is a parser that simply returns what it is given wrapped up as a

1
Parser

type. This lets us do things such as define default alternatives, or in this case, return the accumulated result.

result :: a -> Parser a
result a cs = Just(a,cs)

We also need to revisit

buildOp

. This function now takes two parameters getting the new operation and factor as a pair. It’s a small shuffle so let’s quickly redefine it.

buildOp :: Expression -> ((Expression -> Expression -> Expression), Expression) -> Expression
buildOp expressionA (op, expressionB) = op expressionA expressionB

There are a couple of added benefits to defining

1	expression

using the two parts. We solved the left associative issue because the operations are built up on the way throught the parsing instead of all at the end. Additionally we can now have single term expressions.

Assignable Expressions

In the previous chapter we defined

assign

using

1
digit

and had to use a data constructor directly in

1
parse

to get a valid

1
Assign

. We can make some small changes to use any

1
Expression

on the right of

1
Assign

parse :: String -> Assign
parse s = Assign id expr
    where (id, expr) = case assign s of 
            Nothing -> error "Invalid assignment"
            Just ((a, b), _) -> (a, b)

assign :: Parser (Char, Expression)
assign = letter <+-> literal '=' <+> expression

And now we can get some pretty good (and correct) results out of our parser.

*Main> parse "a=1"
Assign 'a' (Num '1')
*Main> parse "a=1+7-3"
Assign 'a' (Sub (Add (Num '1') (Num '7')) (Num '3'))