More

Drup · 2026-02-19T10:43:17 1771497797

HM is not complex type inference. In fact, among all the approaches you cite, it leads to the simplest type system and the simplest implementation. Moreover, there are lot's of courses, reference implementations, and reasonable extensions for a wide array of features (structural subtyping, qualified types, etc). There are even type-system libraries to make it easy to implement (like Inferno).

When new programmer discover ML-family language, they are often stunned by how lightweight it feels, because you don't have to annotate anything. If your types are not structurally too complicated and you want something really easy to use, HM is still the nicest experience.

Naturally, it's all a tradeoff, and if you want specific features (chiefly: borrows, high-order types, overloading, or Typescript-like features), you will need to abandon complete inference (and use something like bidirectional, most likely).

ruuda · 2026-02-19T14:29:41 1771511381

It's not complex, in the sense that the rules are simple, but simple rules can still lead to complicated emergent behavior that is difficult for humans to understand, even if each of the 153 steps that the typechecker took to arrive at the result were easy to understand individually.

ux266478 · 2026-02-19T15:44:40 1771515880

It's not any different than having 153 steps in any other computational sense. Even limiting ourselves to elementary arithmetic, horrendous opaqueness arises with 153 operations spanning the whole set. Are we going to pretend like arithmetic is a systemically problematic because of this? Any non-trivial formal construct is potentially dangerous.

If you're having trouble reasoning about how variables are unified, it's either because you never actually built a strong gut intuition for it, or it's because you're writing Very Bad Code with major structural issues that just so happen to live in the type system. In this case it's the latter. For an HM type system, 153 choice points for an expression is ludicrous unless you're doing heavy HKT/HOM metaprogramming. The type system, and more broadly unification, is a system to solve constraints. Explosive choice indicates a major logical fault, and most probably someone naively trying to use a structural type system like a nominal one and/or a bit too much unsound metaprogramming.

Thankfully of course, you can simply just specify the type and tell the compiler exactly what it should be using. But that's not really resolving the issue, the code still sucks at the end of the day.

Now higher order unification? That's an entirely different matter.

thechao · 2026-02-19T12:47:27 1771505247

> If your types are not structurally too complicated

Load bearing hand waving.

tialaramex · 2026-02-19T14:37:24 1771511844

Very proportional to the hand waving in claim it was responding to that "in some cases" there might be a problem.

Drup · 2026-02-19T10:27:31 1771496851

I'm going to be contrarian: Yes, you should learn about type systems if you want to design a programming language, and decide in full conscience what you need. At the very least, it will give you a concrete idea of what safety means for a programming language.

It doesn't mean you have to use an advanced one, but your choice choose be based on knowledge, not ignorance.

A lot of harm; including the billion dollar mistake, has been done by badly designed type systems from the Java/C/C++ family.

xmcqdpt2 · 2026-02-19T12:27:12 1771504032

Java also has covariant mutable arrays. I can't believe they created the whole language and didnt realize that covariant arrays are unsound? Or didn't care?

kccqzy · 2026-02-19T14:29:46 1771511386

They didn’t care about preventing all unsoundness at type check time. As long as JVM can detect it and throw an exception, it’s good enough for Java.

Drup · on Nov 8, 2024

You seem well versed into that corner. Do you have a good (and reasonably complete) introduction/exploration for these memory-efficient data-structure for computation ?

I've been working on memory representation of algebraic data types quite a bit, and I've always wondered if we could combine them with succinct data-structures.

pram · on Nov 8, 2024

Theres actually a whole website about it! I found it useful when I was doing deeper research into ElasticSearch: https://roaringbitmap.org

Drup · on Aug 18, 2024

That remark is actually more interesting than you think. As groundbreaking as it was, algorithm W iss far too slow for non-toy languages. All modern HM languages (that I know of) use some form of union-find trickeries, as pioneered by the one presented in the blog post (but also present in resolution-by-constraints approaches employed by Haskell and Scala).

So, in fact, it's actually never algorithm W in non-toy languages. ;)

Side note: this article is originally from 2013 and is considered a must-read by any would-be hackers trying to modify the OCaml typechecker (it's cited in the documentation).

Rusky · on Aug 18, 2024

In fact, those union-find trickeries come from the same paper that presented algorithm W, where they were named algorithm J. W was known from the start to be more useful for proofs than implementation:

> As it stands, W is hardly an efficient algorithm; substitutions are applied too often. It was formulated to aid the proof of soundness. We now present a simpler algorithm J which simulates W in a precise sense.

https://doi.org/10.1016/0022-0000(78)90014-4

derdi · on Aug 19, 2024

Isn't the union-find trickery "just" standard almost-linear unification as discovered by Paterson-Wegman and Martelli-Montanari around 1976? Or is there more to it in the setting of Algorithm J?

I'm actually a bit surprised that it took so long to discover these formulations of unification. I wonder what Prolog systems were doing at the time, given the importance of efficient unification.

Rusky · on Aug 19, 2024

Paterson-Wegman and Martelli-Montanari are worst-case linear, but Algorithm J just uses the earlier almost-linear union-find approach: unification variables are represented as mutable pointers to either a) nothing, or b) another type (which may itself be the "parent" unification variable in the union-find structure). Unification is then just a recursive function over a pair of types that updates these pointers as appropriate- and often doesn't even bother with e.g. path compression.

lupire · on Aug 18, 2024

The Wikipedia articles claims that W is efficient, but only for a core language without highly desirable features like recursion, polymorphism, and subtyping.

https://en.m.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_...

carlmr · on Aug 20, 2024

OCaml without recursion would not be the same.

Drup · on April 5, 2024

Actually, there is a rather pervasive notion of "soundness" in programming language theory. It has many flavors (because, as you hint, programming languages are quite varied). One of the most simple way of stating it, often dubbed "progress and preservation" is: a well typed program can execute (i.e., it doesn't crash) and returns a value of the same type.

This obviously covers type safety, but also spatial memory safety, and can be extended to temporal memory safety, and even some concurrency aspects.

This has been proven for fairly large subsets of programming languages and even some full languages (like SML). Unfortunately, it doesn't hold for most mainstream ones, because programming language theory is often ... under-used ... among language designers.

Drup · on Sept 11, 2023

https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.

Drup · on July 5, 2023

Giuseppe Castagna is a very well known PL researcher, and very seasoned when it comes to gradual types and "structural" types (the static version of duck-typing), which seems like a great fit for Elixir.

Drup · on March 7, 2023

ocaml-re[1] uses a derivative-style construction to lazily build a DFA. The general idea is to use something similar to Owens et al's DFA construction, but doing it inline with some caching, to compile lazily (and building a thomson-like automaton, for group capture). In practice, it is fairly fast, although not as optimized as the Rust crate. :) Derivatives supports several match semantics very easily (ocaml-re does longest, shortest, greedy, non-greedy, first). It indeed doesn't handle unicode matching though (it's possible, there is a prototype implem, but nobody took the time to push it through). Note that it's not difficult to (lazily or not) build a NFA using derivatives as well (with Antimirov's construction).

[1]: https://github.com/ocaml/ocaml-re/

burntsushi · on March 7, 2023

Oh nice! Unicode is definitely something that's on my mind when thinking about derivatives and how to deal with them, but it sounds like ocaml-re is doing pretty well outside of Unicode. I would love to hook it up to my benchmark harness. (It isn't public yet... Hopefully soon. But it supports regexes in any languages. So far I have Rust, C, C++, Python and Go. I hope to add .NET, Perl and Node at least. But this might be a cool addition too.)

If anyone wants to add this Ocaml engine to the harness (or any other engine), please email me at jamslam@gmail.com and I'll give access to the repo. The only reason it isn't public yet is because I'm still working on the initial release and iterating. But it's close enough where other people could submit benchmark programs for other regex engines.

def-lkb · on March 7, 2023

I don't think you should be worried about Unicode in particular. Although the derivation formula on paper is parameterized by a character, you don't have to compute the derivative for every character separately.

It's actually easy to compute classes of characters that have the same derivative (it's done in the linked "Regular-expression derivative re-examined" paper, although their particular implementation favors simplicity over efficiency), and it's not even necessary when using Antimirov's partial derivatives.

Actually, the complexity of the derivation is independent of the size of the alphabet. You could even define derivation on an arbitrary semi-lattice, not necessarily a set of characters. (Or a boolean algebra if you care about negation/complementation).

The difficulty in handling unicode has more to do with the efficiency of the automaton representation and manipulation rather than in turning the RE in an NFA or DFA.

burntsushi · on March 7, 2023

Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)

It has been a long time since I read the "Regular-expression derivative re-examined" derivative paper. Mostly the only thing I remember at this point is that I came away thinking that it would be difficult to adapt in practice for large Unicode classes. But I don't remember the details.

It is honestly very difficult for me to translate your comment here into an actionable implementation strategy. But that's probably just my inexperience with derivatives talking.

def-lkb · on March 7, 2023

> Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)

I don't know any besides ocaml-re that Drup already linked, sorry :).

And sorry that my comment is hard to decipher. I think the core point is that the "character set" can be an abstract type from the point of view of the derivation algorithm. So it doesn't matter how they are represented, nor "how big" a character set is.

With Antimirov's derivative (which produces an NFA), there is no constraint on this type.

With Brzozowski's derivative, you need at least the ability to intersect two character sets. So the type should implement a trait with an intersection function (in Rust syntax, `trait Intersect fn intersect(self, Self) -> Self`). That's necessary for any implementation generating a DFA anyway.

And if you also want to deal with complementation, then a second method `fn negate(self) -> Self` is necessary.

burntsushi · on March 7, 2023

Thanks! You might be right. I'm probably at a point where I'd have to actually go out and try it to understand it better.

I do wonder if there is some room for derivatives in a meta regex engine (like RE2 or the regex crate). For example, if it let you build a DFA more quickly (in practice, not necessarily in theory), then you might be able to use it for a big subset of cases. It's tricky to make that case over the lazy DFA, however, a full DFA has more optimization opportunities. For example, identifying states with very few outgoing transitions and "accelerating" them by running memchr (or memchr2 or memchr3) on those outgoing transitions instead of continuing to walk the automaton. It's really hard to do that with a lazy DFA because you don't really compute entire states up front.

def-lkb · on March 7, 2023

I think what you suggest is possible, derivation might even be well suited for this application, however I can't tell if it would be better than existing approaches. There are some chances that it might be interesting in practice, since it seems that this application of derivatives has not been much studied, but that's highly speculative.

Drup · on March 7, 2023

Having a good quality and curated regex benchmarks would be quite useful! I hope you plan on having several features, and engines that can only have partial support. That would make for very interesting comparisons.

burntsushi · on March 7, 2023

It does. And more. The only thing you have to do is provide a short program that parses the description of the benchmark on stdin, and then output a list of samples that consist of the time it took to run a single iteration and the "result" of the benchmark for verification. The harness takes over from there. There's no need to have any Unicode support at all. I even have a program for benchmarking `memmem`, which is of course not a regex engine at all.

Drup · on June 17, 2022

`ocaml-ctypes` currently supports "reverse-bindings" (making OCaml functions available from C) out-of-the-box and mostly takes care of the intersection you are talking about, so this already works quite well.

The only gain from emiting C code is portability to weird architecture that would be covered by C compilers but not the OCaml one; which is arguably a pretty niche use-case.

omniscient_oce · on June 18, 2022

I thought OCaml was dropping support for 32-bit ARM. I wouldn't call that a weird architecture to want to run on.

Drup · on Jan 14, 2022

I would encourage you not to use DOIs for software. They are not made for this, and have limitations which are not appropriate.

Instead, use Software Heritage https://www.softwareheritage.org/ , it provides unique identifiers and actually understand repositories, versioning, history, etc. It also allows you to cite the software and even give proper, durable links to point to the code.

Siira · on Jan 15, 2022

Why not just link to a specific commit hash? What more do they provide?