Every few years I look at the latest parser combinators, parser generators, compiler compilers. And every time they seem to lack in some huge vital way (not always all of them, but always at least 1):
* Error handling always ends up being non-existent or of the quality of "begin, for, if, while, repeat, identifier, number, float expected" with no good way to override what happens
* Recovery is usually impossible
* Parser generator generates a full model that doesn't match what we need
* Working around the quirks of the input language ends up being more tricky than hand writing (almost every language has some ambigiuty)
* Slow: With ANTLR it's really easy to make it do gigantic amount of look aheads in complex languages, which isn't even really needed
I always end up going back to a simple hand crafted parser which is easier to read and write.
> * Error handling always ends up being non-existent or of the quality of "begin, for, if, while, repeat, identifier, number, float expected" with no good way to override what happens
Probably true in many toy-versions.
However parser combinators are very well suited for overriding the default expectation error messages with something custom. For example using a custom combinator `<?>` with low precedence.
No the parser won't enumerate all the options, but now can give you a high level expectation.
> * Recovery is usually impossible
There are a bunch of error correcting combinator libraries out there.
Also, when using monadic combinators you can do all kinds of retrospective work while parsing. Probably more expensive, but entirely possible. Some pseudo Haskell example:
parseHtmlBody :: Parser Html
parseHtmlBody =
do parseOpenTag "body"
contents <- parseHtml
tag <- tryParsingClosingTag "body"
when (tag /= "body") $
raiseWarning ("expected </body> seems like something else: " <> closing)
return contents
New version of the Utrecht University parser combinator library, which provides online, error correction, annotation free, applicative style parser combinators. In addition to this we provide a monadic and an idomatic interface. Parsers do analyse themselves to avoid commonly made errors.
So you get online parsing which means you can extract partial results available.
You get error correction which accumulate errors and lets you get partial results ranked on the cumulative severity of errors happened.
You also get a choice of Applicative interface which lets you build parsers that are mostly context-free (but see [2]) and monadic interface which allows you to have context-dependent parsing, online (you may report errors in Ada-like languages "function f ... begin end function g" on the "g" identifier online).
The implementation above also automatically factor prefixes and does not do repeated scanning after, say, rollback.
Would that somehow make your thirst less parching?
PS
My experience with ANTLR is quite the opposite, in the case of VHDL, at the very least (preprocessors required for Verilog aren't even expressible in ANTLR). Either I had to do a lot of work over ANTLR grammar to make it faster and lose a lot of conciseness or it is slow as snail, getting as slow as 4K bytes of source code per second.
PPS
With parser combinators I can have an ability to test just parts of grammar in REPL and speed is much more predictable, if not faster.
PPPS
I even encountered superlinear optimization gain produced by Glorious Haskell Compiler for some implementation of parser combinators: the parsing time for grammar that must have quadratic parsing time (over the length of input) has leading power of about 1.3. In other words, instead of T=O(L^2) (ghci, code interpreted without optimization, and C#, both gave this) I got T=O(L^1.3) (ghc -O3).
I've run into some limitations, too. It doesn't really break the deal for me, because I typically only use them for smaller jobs. The type where it's often more common to use regular expressions, which are generally even worse in most those categories.
My guess is that it's coming from a difference in the framing of how thins work. In ANTLR, you supply a grammar, and then the grammar gets compiled into a parser, all in one go. That maximizes the elbow room available for fine-tuning the overall product.
Parser combinators, OTOH, frame the problem as building larger parsers by composing smaller parsers. Each of those parsers is presumed to be fully formed. There's a beauty to that approach that I really like. But it does mean that the parser library is forever dealing with arrangements of trees, without much of an option to take a step back and look at the forest.
(Unless you're working in a homoiconic language like lisp, anyway.)
I'm not an expert, but from my reading, I believe PEGs typically have the best bang for their buck here. Have you read much on them? (Python 3.9 is actually moving to a PEG for their parser!)
Graham Hutton recently did a tutorial via Computerphile YT channel on functional parsing [0].
Where he starts from the scratch and explains the process of creating your own library for parsing.
The YT video description has the link to the full version of the library that he starts creating in his tutorial. It's kind of an elegant Haskell programing that I don't think I'll ever achieve. [1]
Parser combinators are just recursive descent. That's it. They're just recursive descent - an idea that's been around since the 1960's.
A parser combinator library is just fancy marketing speak for "recursive descent utility functions". All it is is a a bag of commonly used patterns wrapped up in generic functions.
Yes they are, but it's not about being novel. Parser combinators are abstraction, and damn good one: they are easily composable, reusable, and what's most important approachable even for somebody who is familiar with the functional approach but have never written parser before. While I could appreciate neat and tidy imperative recursive descent implementation, I would very much prefer to enjoy it from afar, not inside my own codebase.
And Pratt parsers are just precedence climbing, but I’ve rewritten unifdef’s precedence climbing expression evaluator in Pratt style and it has turned out much better, because of the way Pratt structures the technique so neatly. So the key feature of parser combinators is a neat and tidy way to structure a recursive descent parser, and the key feature of monadic parser combinators is to get extra neatness from Haskell’s categorical types.
I wish more tutorials started with "Parser combinators is a neat and tidy way to structure a recursive descent parser" because I 100% agree. Mostly, my disagreements are how parser combinators presented and taught, obscuring the core ideas with a focus on language features and trivialities.
To me, the monad/applicate stuff is a red herring. It's mostly used to simulate imperative sequencing. e.g. the Haskell code `Person <$> parseName <*> parseAddress` would be `return Person { parseName(), parseAddress() };` in C. There's a few tricks but it's not crucial to the parser combinator idea and doesn't help readability.
So, one of the key things that I see is that the constructed abstract syntax tree must have the raw tokens within it so all other layers have full awareness.
I feel the criticism is valid because most parsers in production are done via hand without tooling or fancy techniques.
> So, one of the key things that I see is that the constructed abstract syntax tree must have the raw tokens within it so all other layers have full awareness.
I don't see why you say that, parser combinators are just a structured way to compose recursive descent parsers, you can do everything with them that you can with any other hand-rolled parser. You don't even need to construct an AST at all, a mathematical expression "parser combinator" could just return a single number, the result of evaluating the expression.
> most parsers in production are done via hand without tooling or fancy techniques.
What? Do you have any data you'd like to share?
I'm given to understand that e.g. the C++ compilers usually have a hand-coded, but AFAIUI that's mostly due to the complexity of actually parsing it (and fitting that into anything other than just raw code).
A common theme is that the parser generator does not provide you the tools to write the high quality error messages. Having used ANTLR and other tools, I believe it now that I'm trying to ship a real language.
You're talking about parser generators... the article is about parser combinators. Fully agreed that parser generators are often very limited and often require hacks to do anything non-trivial.
(IME MegaParsec basically solves most of the issues.)
Agreed, and it's disappointing that you got downvoted to oblivion for raising a valid point. There is an abundance of articles on parser combinators and grammars and other theoretical parsing concepts, and a dearth of articles (AFAICT) on actual practical applications or graceful error handling.
I suspect the difficulty of error handling or parsing a non-trivial language are exactly why few articles cover them. Crafting Interpreters¹ seems to be a notable exception here.
I don’t know much about Elm, but one of the few things I have heard of is this GitHub issue: https://github.com/elm/compiler/issues/1773. Am I putting too much emphasis on this particular picked cherry or does the compiler do this in general?
* Error handling always ends up being non-existent or of the quality of "begin, for, if, while, repeat, identifier, number, float expected" with no good way to override what happens
* Recovery is usually impossible
* Parser generator generates a full model that doesn't match what we need
* Working around the quirks of the input language ends up being more tricky than hand writing (almost every language has some ambigiuty)
* Slow: With ANTLR it's really easy to make it do gigantic amount of look aheads in complex languages, which isn't even really needed
I always end up going back to a simple hand crafted parser which is easier to read and write.