Parsing

Idris 2 comes with a Lexing and a Parsing library built into the contrib package.

For this cookbook, we will write a very simple parser for a lambda calculus parser that will accept the following language:

let name = world in (\x.hello x) name

Once we write a lambda calculus parser, we will also see how we can take advantage of a powerful built in expression parser in Idris 2 to write a small calculator that should be smart enough to parse the following expression:

1 + 2 - 3 * 4 / 5

Lexer

The main lexer module is under Text.Lexer. This module contains toTokenMap which is a function that converts a List (Lexer, k) -> TokenMap (Token k) where k is a token kind. This function could be used for simple lexer to token mappings. The module also includes high level lexers for specifying quantity and common programming primitives like alphas, intLit, lineComment and blockComment.

The Text.Lexer module also reexports Text.Lexer.Core, Text.Quantity and Text.Token.

Text.Lexer.Core provides the building blocks of the lexer, including a type called Recognise which is the underlying data type for the lexer. The other important function that this module provides is a lex which takes in a lexer and returns the tokens.

Text.Quantity provides a data type Quantity which can be used with certain lexers to specify how many times something is expected to appear.

Text.Token provides a data type Token that represents a parsed token, its kind and the text. This module also provides an important interface called TokenKind which tells the lexer how to map token kinds to Idris 2 types and how to convert each kind from a string to a value.

Parser

The main parser module is under Text.Parser. This module contains different grammar parsers, the main one being match which takes a TokenKind and returns the value as defined in the TokenKind interface. There are other grammar parsers as well, but for our example, we will only be using match.

The Text.Parser module reexports Text.Parser.Core, Text.Quantity and Text.Token.

Text.Parser.Core provides the building blocks of the parser, including a type called Grammar which is the underlying data type for the parser. The other important function that this module provides is parse which takes in a Grammar and returns the parsed expression.

We covered Text.Quantity and Text.Token in the Lexer section so we’re not going to repeat what they do here.

Lambda Calculus Lexer & Parser

LambdaCalculus.idr
  1import Data.List
  2import Data.List1
  3import Text.Lexer
  4import Text.Parser
  5
  6%default total
  7
  8data Expr = App Expr Expr | Abs String Expr | Var String | Let String Expr Expr
  9
 10Show Expr where
 11  showPrec d (App e1 e2) = showParens (d == App) (showPrec (User 0) e1 ++ " " ++ showPrec App e2)
 12  showPrec d (Abs v e) = showParens (d > Open) ("\\" ++ v ++ "." ++ show e)
 13  showPrec d (Var v) = v
 14  showPrec d (Let v e1 e2) = showParens (d > Open) ("let " ++ v ++ " = " ++ show e1 ++ " in " ++ show e2)
 15
 16data LambdaTokenKind
 17  = LTLambda
 18  | LTIdentifier
 19  | LTDot
 20  | LTOParen
 21  | LTCParen
 22  | LTIgnore
 23  | LTLet
 24  | LTEqual
 25  | LTIn
 26
 27Eq LambdaTokenKind where
 28  (==) LTLambda LTLambda = True
 29  (==) LTDot LTDot = True
 30  (==) LTIdentifier LTIdentifier = True
 31  (==) LTOParen LTOParen = True
 32  (==) LTCParen LTCParen = True
 33  (==) LTLet LTLet = True
 34  (==) LTEqual LTEqual = True
 35  (==) LTIn LTIn = True
 36  (==) _ _ = False
 37
 38Show LambdaTokenKind where
 39  show LTLambda = "LTLambda"
 40  show LTDot = "LTDot"
 41  show LTIdentifier = "LTIdentifier"
 42  show LTOParen = "LTOParen"
 43  show LTCParen = "LTCParen"
 44  show LTIgnore = "LTIgnore"
 45  show LTLet = "LTLet"
 46  show LTEqual = "LTEqual"
 47  show LTIn = "LTIn"
 48
 49LambdaToken : Type
 50LambdaToken = Token LambdaTokenKind
 51
 52Show LambdaToken where
 53  show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text
 54
 55TokenKind LambdaTokenKind where
 56  TokType LTIdentifier = String
 57  TokType _ = ()
 58
 59  tokValue LTLambda _ = ()
 60  tokValue LTIdentifier s = s
 61  tokValue LTDot _ = ()
 62  tokValue LTOParen _ = ()
 63  tokValue LTCParen _ = ()
 64  tokValue LTIgnore _ = ()
 65  tokValue LTLet _ = ()
 66  tokValue LTEqual _ = ()
 67  tokValue LTIn _ = ()
 68
 69ignored : WithBounds LambdaToken -> Bool
 70ignored (MkBounded (Tok LTIgnore _) _ _) = True
 71ignored _ = False
 72
 73identifier : Lexer
 74identifier = alpha <+> many alphaNum
 75
 76keywords : List (String, LambdaTokenKind)
 77keywords = [
 78  ("let", LTLet),
 79  ("in", LTIn)
 80]
 81
 82lambdaTokenMap : TokenMap LambdaToken
 83lambdaTokenMap = toTokenMap [(spaces, LTIgnore)] ++
 84  [(identifier, \s =>
 85      case lookup s keywords of
 86        (Just kind) => Tok kind s
 87        Nothing => Tok LTIdentifier s
 88    )
 89  ] ++ toTokenMap [
 90    (exact "\\", LTLambda),
 91    (exact ".", LTDot),
 92    (exact "(", LTOParen),
 93    (exact ")", LTCParen),
 94    (exact "=", LTEqual)
 95  ]
 96
 97lexLambda : String -> Maybe (List (WithBounds LambdaToken))
 98lexLambda str =
 99  case lex lambdaTokenMap str of
100    (tokens, _, _, "") => Just tokens
101    _ => Nothing
102
103mutual
104  expr : Grammar state LambdaToken True Expr
105  expr = do
106    t <- term
107    app t <|> pure t
108
109  term : Grammar state LambdaToken True Expr
110  term = abs
111    <|> var
112    <|> paren
113    <|> letE
114
115  app : Expr -> Grammar state LambdaToken True Expr
116  app e1 = do
117    e2 <- term
118    app1 $ App e1 e2
119
120  app1 : Expr -> Grammar state LambdaToken False Expr
121  app1 e = app e <|> pure e
122
123  abs : Grammar state LambdaToken True Expr
124  abs = do
125    match LTLambda
126    commit
127    argument <- match LTIdentifier
128    match LTDot
129    e <- expr
130    pure $ Abs argument e
131
132  var : Grammar state LambdaToken True Expr
133  var = map Var $ match LTIdentifier
134
135  paren : Grammar state LambdaToken True Expr
136  paren = do
137    match LTOParen
138    e <- expr
139    match LTCParen
140    pure e
141
142  letE : Grammar state LambdaToken True Expr
143  letE = do
144    match LTLet
145    commit
146    argument <- match LTIdentifier
147    match LTEqual
148    e1 <- expr
149    match LTIn
150    e2 <- expr
151    pure $ Let argument e1 e2
152
153parseLambda : List (WithBounds LambdaToken) -> Either String Expr
154parseLambda toks =
155  case parse expr $ filter (not . ignored) toks of
156    Right (l, []) => Right l
157    Right e => Left "contains tokens that were not consumed"
158    Left e => Left (show e)
159
160parse : String -> Either String Expr
161parse x =
162  case lexLambda x of
163    Just toks => parseLambda toks
164    Nothing => Left "Failed to lex."

Testing out our parser gives us back the following output:

$ idris2 -p contrib LambdaCalculus.idr
Main> :exec printLn $ parse "let name = world in (\\x.hello x) name"
Right (let name = world in (\x.hello x) name)

Expression Parser

Idris 2 also comes with a very convenient expression parser that is aware of precedence and associativity in Text.Parser.Expression.

The main function called buildExpressionParser takes in an OperatorTable and a Grammar that represents the terms, and returns a parsed expression. The magic comes from the OperatorTable since this table defines all the operators, the grammars for those operators, the precedence, and the associativity.

An OperatorTable is a list of lists containing the Op type. The Op type allows you to specify Prefix, Postfix, and Infix operators along with their grammars. Infix also contains the associativity called Assoc which can specify left associativity or AssocLeft, right associativity assoc or AssocRight and as being non-associative or AssocNone.

An example of an operator table we’ll be using for the calculator is:

[
  [ Infix (match CTMultiply >> pure (*)) AssocLeft
  , Infix (match CTDivide >> pure (/)) AssocLeft
  ],
  [ Infix (match CTPlus >> pure (+)) AssocLeft
  , Infix (match CTMinus >> pure (-)) AssocLeft
  ]
]

This table defines 4 operators for mulitiplication, division, addition and subtraction. Mulitiplication and division show up in the first table because they have higher precedence than addition and subtraction, which show up in the second table. We’re also defining them as infix operators, with a specific grammar and all being left associative via AssocLeft.

Building a Calculator

Calculator.idr
  1import Data.List1
  2import Text.Lexer
  3import Text.Parser
  4import Text.Parser.Expression
  5
  6%default total
  7
  8data CalculatorTokenKind
  9  = CTNum
 10  | CTPlus
 11  | CTMinus
 12  | CTMultiply
 13  | CTDivide
 14  | CTOParen
 15  | CTCParen
 16  | CTIgnore
 17
 18Eq CalculatorTokenKind where
 19  (==) CTNum CTNum = True
 20  (==) CTPlus CTPlus = True
 21  (==) CTMinus CTMinus = True
 22  (==) CTMultiply CTMultiply = True
 23  (==) CTDivide CTDivide = True
 24  (==) CTOParen CTOParen = True
 25  (==) CTCParen CTCParen = True
 26  (==) _ _ = False
 27
 28Show CalculatorTokenKind where
 29  show CTNum = "CTNum"
 30  show CTPlus = "CTPlus"
 31  show CTMinus = "CTMinus"
 32  show CTMultiply = "CTMultiply"
 33  show CTDivide = "CTDivide"
 34  show CTOParen = "CTOParen"
 35  show CTCParen = "CTCParen"
 36  show CTIgnore = "CTIgnore"
 37
 38CalculatorToken : Type
 39CalculatorToken = Token CalculatorTokenKind
 40
 41Show CalculatorToken where
 42    show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text
 43
 44TokenKind CalculatorTokenKind where
 45  TokType CTNum = Double
 46  TokType _ = ()
 47
 48  tokValue CTNum s = cast s
 49  tokValue CTPlus _ = ()
 50  tokValue CTMinus _ = ()
 51  tokValue CTMultiply _ = ()
 52  tokValue CTDivide _ = ()
 53  tokValue CTOParen _ = ()
 54  tokValue CTCParen _ = ()
 55  tokValue CTIgnore _ = ()
 56
 57ignored : WithBounds CalculatorToken -> Bool
 58ignored (MkBounded (Tok CTIgnore _) _ _) = True
 59ignored _ = False
 60
 61number : Lexer
 62number = digits
 63
 64calculatorTokenMap : TokenMap CalculatorToken
 65calculatorTokenMap = toTokenMap [
 66  (spaces, CTIgnore),
 67  (digits, CTNum),
 68  (exact "+", CTPlus),
 69  (exact "-", CTMinus),
 70  (exact "*", CTMultiply),
 71  (exact "/", CTDivide)
 72]
 73
 74lexCalculator : String -> Maybe (List (WithBounds CalculatorToken))
 75lexCalculator str =
 76  case lex calculatorTokenMap str of
 77    (tokens, _, _, "") => Just tokens
 78    _ => Nothing
 79
 80mutual
 81  term : Grammar state CalculatorToken True Double
 82  term = do
 83    num <- match CTNum
 84    pure num
 85
 86  expr : Grammar state CalculatorToken True Double
 87  expr = buildExpressionParser [
 88    [ Infix ((*) <$ match CTMultiply) AssocLeft
 89    , Infix ((/) <$ match CTDivide) AssocLeft
 90    ],
 91    [ Infix ((+) <$ match CTPlus) AssocLeft
 92    , Infix ((-) <$ match CTMinus) AssocLeft
 93    ]
 94  ] term
 95
 96parseCalculator : List (WithBounds CalculatorToken) -> Either String Double
 97parseCalculator toks =
 98  case parse expr $ filter (not . ignored) toks of
 99    Right (l, []) => Right l
100    Right e => Left "contains tokens that were not consumed"
101    Left e => Left (show e)
102
103parse1 : String -> Either String Double
104parse1 x =
105  case lexCalculator x of
106    Just toks => parseCalculator toks
107    Nothing => Left "Failed to lex."

Testing out our calculator gives us back the following output:

$ idris2 -p contrib Calculator.idr
Main> :exec printLn $ parse1 "1 + 2 - 3 * 4 / 5"
Right 0.6000000000000001