SQLGlot is great. We've used it to extend our FOSS probabilistic data linking library[1] so that it is now capable of executing against a variety of SQL backends (Spark, Presto, DuckDB, Sqlite), significantly widening our potential user base.
We implement the core statistical model in SQL, and then use SQLGlot to transpile to the target execution engine. One big motivation was to futureproof our work - we're no longer tied down to Spark, and so when the 'next big thing' (GPU accelerated SQL for analytics?) comes along, it should be relatively straightforward to support it by writing another adaptor.
Working on this has highlighted some of the really tricky problems associated with translating between SQL engines, and we haven't hit any major problems, so kudos to the author!
First, the SQL involves complex analytical queries on large datasets that need careful pipelining, caching and optimisation. I wasn't sure the extent to which this was possible in sqlalchemy.
Second, it was important that our implementation of the em algorithm (a iterative numerical approach for maximising a likelihood function) was readable/understandable and I felt that readers of the code were more likely to know SQL than sqlalchemy. Certainly i was more comfortable expressing it in SQL than another (i.e. sqlalchemy's) API.
Third, our API allows the user to inject SQL to customise their data linking models and it felt more natural for this to be directly executed rather than go through an abstraction layer.
I'm not a sqlalchemy expert, but my sense is thats it's more appropriate for transaction/atomic SQL than for complex analytical queries
Well you're not wrong per se, sqlalchemy is primarily an ORM system. However one of its components is a module purely for sql generation, which you could (not saying you should) use to generate (complex) sql queries. Obviously their main concern is querying and updating rows, but I think pretty much all common sql constructs are supported (and if not it is fairly easy to add your own).
I can't say that I ever had to implement an EM algorithm that way though, so I can't say how complicated that would be. Certainly it'll take some more effort than writing it in a familiar SQL dialect. The main reason to do it would be that sqlalchemy has pretty good support for quite a lot of databases.
We implement the core statistical model in SQL, and then use SQLGlot to transpile to the target execution engine. One big motivation was to futureproof our work - we're no longer tied down to Spark, and so when the 'next big thing' (GPU accelerated SQL for analytics?) comes along, it should be relatively straightforward to support it by writing another adaptor.
Working on this has highlighted some of the really tricky problems associated with translating between SQL engines, and we haven't hit any major problems, so kudos to the author!
[1] https://github.com/moj-analytical-services/splink/tree/splin...