What's in a Parser Combinator? (2016)

ckok · on May 29, 2020

Every few years I look at the latest parser combinators, parser generators, compiler compilers. And every time they seem to lack in some huge vital way (not always all of them, but always at least 1):

* Error handling always ends up being non-existent or of the quality of "begin, for, if, while, repeat, identifier, number, float expected" with no good way to override what happens

* Recovery is usually impossible

* Parser generator generates a full model that doesn't match what we need

* Working around the quirks of the input language ends up being more tricky than hand writing (almost every language has some ambigiuty)

* Slow: With ANTLR it's really easy to make it do gigantic amount of look aheads in complex languages, which isn't even really needed

I always end up going back to a simple hand crafted parser which is easier to read and write.

sfvisser · on May 29, 2020

> * Error handling always ends up being non-existent or of the quality of "begin, for, if, while, repeat, identifier, number, float expected" with no good way to override what happens

Probably true in many toy-versions.

However parser combinators are very well suited for overriding the default expectation error messages with something custom. For example using a custom combinator `<?>` with low precedence.

    parseStatement :: Parser Statement
    parseStatement = parseIf
                 <|> parseBlock
                 <|> parseWhile
                 <|> parseRepeat
                 <?> "statement"

No the parser won't enumerate all the options, but now can give you a high level expectation.

> * Recovery is usually impossible

There are a bunch of error correcting combinator libraries out there.

Also, when using monadic combinators you can do all kinds of retrospective work while parsing. Probably more expensive, but entirely possible. Some pseudo Haskell example:

    parseHtmlBody :: Parser Html
    parseHtmlBody =
      do parseOpenTag "body"
         contents <- parseHtml
         tag <- tryParsingClosingTag "body"
         when (tag /= "body") $
           raiseWarning ("expected </body> seems like something else: " <> closing)
         return contents

thesz · on May 29, 2020

Have you looked at UUParsingLib [1]?

https://hackage.haskell.org/package/uu-parsinglib

To quote:

New version of the Utrecht University parser combinator library, which provides online, error correction, annotation free, applicative style parser combinators. In addition to this we provide a monadic and an idomatic interface. Parsers do analyse themselves to avoid commonly made errors.

So you get online parsing which means you can extract partial results available.

You get error correction which accumulate errors and lets you get partial results ranked on the cumulative severity of errors happened.

You also get a choice of Applicative interface which lets you build parsers that are mostly context-free (but see [2]) and monadic interface which allows you to have context-dependent parsing, online (you may report errors in Ada-like languages "function f ... begin end function g" on the "g" identifier online).

The implementation above also automatically factor prefixes and does not do repeated scanning after, say, rollback.

Would that somehow make your thirst less parching?

[2] https://byorgey.wordpress.com/2012/01/05/parsing-context-sen...

PS My experience with ANTLR is quite the opposite, in the case of VHDL, at the very least (preprocessors required for Verilog aren't even expressible in ANTLR). Either I had to do a lot of work over ANTLR grammar to make it faster and lose a lot of conciseness or it is slow as snail, getting as slow as 4K bytes of source code per second.

PPS With parser combinators I can have an ability to test just parts of grammar in REPL and speed is much more predictable, if not faster.

PPPS I even encountered superlinear optimization gain produced by Glorious Haskell Compiler for some implementation of parser combinators: the parsing time for grammar that must have quadratic parsing time (over the length of input) has leading power of about 1.3. In other words, instead of T=O(L^2) (ghci, code interpreted without optimization, and C#, both gave this) I got T=O(L^1.3) (ghc -O3).

mumblemumble · on May 29, 2020

I've run into some limitations, too. It doesn't really break the deal for me, because I typically only use them for smaller jobs. The type where it's often more common to use regular expressions, which are generally even worse in most those categories.

My guess is that it's coming from a difference in the framing of how thins work. In ANTLR, you supply a grammar, and then the grammar gets compiled into a parser, all in one go. That maximizes the elbow room available for fine-tuning the overall product.

Parser combinators, OTOH, frame the problem as building larger parsers by composing smaller parsers. Each of those parsers is presumed to be fully formed. There's a beauty to that approach that I really like. But it does mean that the parser library is forever dealing with arrangements of trees, without much of an option to take a step back and look at the forest.

(Unless you're working in a homoiconic language like lisp, anyway.)

wwright · on May 29, 2020

I'm not an expert, but from my reading, I believe PEGs typically have the best bang for their buck here. Have you read much on them? (Python 3.9 is actually moving to a PEG for their parser!)

bmichel · on May 29, 2020

By curiosity, did you try Lezer?

https://marijnhaverbeke.nl/blog/lezer.html

carapace · on May 29, 2020

That is keen!

salimmadjd · on May 29, 2020

Graham Hutton recently did a tutorial via Computerphile YT channel on functional parsing [0].

Where he starts from the scratch and explains the process of creating your own library for parsing.

The YT video description has the link to the full version of the library that he starts creating in his tutorial. It's kind of an elegant Haskell programing that I don't think I'll ever achieve. [1]

[0]https://www.youtube.com/watch?v=dDtZLm7HIJs

[1]http://www.cs.nott.ac.uk/~pszgmh/Parsing.hs

jmeister · on May 29, 2020

If you want more such elegant Haskell/math check out Conal Eliott( conal.net )

anarchyrucks · on May 29, 2020

Parsing from first principles by Saša Jurić [0] is a good video on writing parser combinator in Elixir.

[0] https://youtu.be/xNzoerDljjo

pubby · on May 29, 2020

Parser combinators are just recursive descent. That's it. They're just recursive descent - an idea that's been around since the 1960's.

A parser combinator library is just fancy marketing speak for "recursive descent utility functions". All it is is a a bag of commonly used patterns wrapped up in generic functions.

The hoopla is overblown.

GuuD · on May 29, 2020

Yes they are, but it's not about being novel. Parser combinators are abstraction, and damn good one: they are easily composable, reusable, and what's most important approachable even for somebody who is familiar with the functional approach but have never written parser before. While I could appreciate neat and tidy imperative recursive descent implementation, I would very much prefer to enjoy it from afar, not inside my own codebase.

fanf2 · on May 29, 2020

And Pratt parsers are just precedence climbing, but I’ve rewritten unifdef’s precedence climbing expression evaluator in Pratt style and it has turned out much better, because of the way Pratt structures the technique so neatly. So the key feature of parser combinators is a neat and tidy way to structure a recursive descent parser, and the key feature of monadic parser combinators is to get extra neatness from Haskell’s categorical types.

pubby · on May 29, 2020

I wish more tutorials started with "Parser combinators is a neat and tidy way to structure a recursive descent parser" because I 100% agree. Mostly, my disagreements are how parser combinators presented and taught, obscuring the core ideas with a focus on language features and trivialities.

To me, the monad/applicate stuff is a red herring. It's mostly used to simulate imperative sequencing. e.g. the Haskell code `Person <$> parseName <*> parseAddress` would be `return Person { parseName(), parseAddress() };` in C. There's a few tricks but it's not crucial to the parser combinator idea and doesn't help readability.

fuckface123 · on May 29, 2020

gracefully handling errors and giving useful output is the hardest part of writing parsers, why does everybody skip over that part?? :)

lalaithion · on May 29, 2020

You can get parsing stacktraces by changing the definition of Parser to

    data Parser a c = Parser { runParser :: String -> Either [c] (a, String) }

and adding a combinator

    withContext c p = Parser \s -> case runParser p of
      Left stackTrace = Left (c:stackTrace)
      Right x = Right x

This allows you to write stuff like

    number = withContext "parsing a number" $ ...
    addition = withContext "parsing addition expression" $ ...
    expr = withContext "parsing a mathematical expression" $ ...

and combine that with a technique that keeps track of where you are in the string when failure occurs, you can pretty print that to something like:

Failed to parse!

    2 + 34.0O4
            ^

While: parsing a number

While: parsing addition expression

While: parsing a mathematical expression

mathgladiator · on May 29, 2020

So, one of the key things that I see is that the constructed abstract syntax tree must have the raw tokens within it so all other layers have full awareness.

I feel the criticism is valid because most parsers in production are done via hand without tooling or fancy techniques.

ookdatnog · on May 29, 2020

> So, one of the key things that I see is that the constructed abstract syntax tree must have the raw tokens within it so all other layers have full awareness.

I don't see why you say that, parser combinators are just a structured way to compose recursive descent parsers, you can do everything with them that you can with any other hand-rolled parser. You don't even need to construct an AST at all, a mathematical expression "parser combinator" could just return a single number, the result of evaluating the expression.

Quekid5 · on May 29, 2020

> most parsers in production are done via hand without tooling or fancy techniques.

What? Do you have any data you'd like to share?

I'm given to understand that e.g. the C++ compilers usually have a hand-coded, but AFAIUI that's mostly due to the complexity of actually parsing it (and fitting that into anything other than just raw code).

mathgladiator · on May 29, 2020

only anecdotal and observations over the years

https://www.drdobbs.com/architecture-and-design/so-you-want-...

A common theme is that the parser generator does not provide you the tools to write the high quality error messages. Having used ANTLR and other tools, I believe it now that I'm trying to ship a real language.

Quekid5 · on May 29, 2020

You're talking about parser generators... the article is about parser combinators. Fully agreed that parser generators are often very limited and often require hacks to do anything non-trivial.

(IME MegaParsec basically solves most of the issues.)

wwright · on May 29, 2020

Python 3.9 is actually switching to a PEG :)

yellowapple · on May 29, 2020

Agreed, and it's disappointing that you got downvoted to oblivion for raising a valid point. There is an abundance of articles on parser combinators and grammars and other theoretical parsing concepts, and a dearth of articles (AFAICT) on actual practical applications or graceful error handling.

I suspect the difficulty of error handling or parsing a non-trivial language are exactly why few articles cover them. Crafting Interpreters¹ seems to be a notable exception here.

----

¹: https://www.craftinginterpreters.com/parsing-expressions.htm...

whateveracct · on May 29, 2020

Mega parsec is a parser combinator library that definitely satisfies these requirements.

In fact, I would say Haskell-style parser combinators are the best way to satisfy those requirements in general.

madushan1000 · on May 29, 2020

Checkout Munch[1], it has full error handling and fully functioning, the tutorial is long and sometimes hard to follow though.

[1] http://nbloomf.blog/munch/munch.html

BoiledCabbage · on May 29, 2020

Anyone know how Elm does it? Elm is incredible. It should be the gold standard any compiler strives for is quality of messages.

saagarjha · on May 29, 2020

I don’t know much about Elm, but one of the few things I have heard of is this GitHub issue: https://github.com/elm/compiler/issues/1773. Am I putting too much emphasis on this particular picked cherry or does the compiler do this in general?

hawski · on May 29, 2020

It seems that instead of fixing it it is an error now [0]. This and the comment from the issue [1] do not inspire confidence in me.

[0] https://github.com/elm/compiler/commit/0e669b8ad076f64e450f2...

[1] https://github.com/elm/compiler/issues/1773#issuecomment-418... the same on master