Clickhouse is criminally underused. It's common knowledge that 'postgres is all ...

mrsilencedogood · on Oct 22, 2024

This is my take too. At one of my old jobs, we were early (very early) to the Hadoop and then Spark games. Maybe too early, because by the time Spark 2 made it all easy, we had already written a lot of mapreduce-streaming and then some RDD-based code. Towards the end of my tenure there, I was experimenting with alternate datastores, and clickhouse was one I evaluated. It worked really, really well in my demos. But I couldn't get buy-in because management was a little wary of the russian side of it (which they have now distanced/divorced from, I think?) and also they didn't really have the appetite for such a large undertaking anymore. (The org was going through some things.) (So instead a different team blessed by the company owner basically DIYd a system to store .feather files on NVME SSDs... anyway).

If I were still there, I'd be pushing a lot harder to finally throw away the legacy system (which has lost so many people it's basically ossified, anyway) and just "rebase" it all onto clickhouse and pyspark sparksql. We would throw away so much shitty cruft, and a lot of the newer mapreduce and RDD code is pretty portable to the point that it could be plugged into RDD's pipe() method.

Anyway. My current job, we just stood up a new product that, from day 1, was ingesting billions of rows (event data) (~nothing for clickhouse, to be clear. but obviously way too much for pg). And it's just chugging along. Clickhouse is definitely in my toolbox right after postgres, as you state.

osigurdson · on Oct 22, 2024

Agree. CH is a great technology to have some awareness of. I use it for "real things" (100B+ data points) but honestly it can really simplify little things as well.

I'd throw in one more to round it out however. The three rings of power are Postgres, ClickHouse and NATS. Postgres is the most powerful ring however and lots of times all you need.

oulipo · on Oct 22, 2024

would you recommend clickhouse over duckdb? and why?

nasretdinov · on Oct 22, 2024

IMO the only reason to not use ClickHouse is when you either have "small" amount of data or "small" servers (<100 Gb of data, servers with <64 Gb of RAM). Otherwise ClickHouse is a better solution since it's a standalone DB that supports replication and in general has very very robust cluster support, easily scaling to hundreds of nodes.

Typically when you discover the need for OLAP DB is when you reach that scale, so I'm personally not sure what the real use case for DuckDB is to be completely honest.

justCHurious · on Oct 23, 2024

There is another place where you should not use CH, and it's in a system with shared resources. CH loves, and earned the right, to have spikes of hogging resources. They even allude to this on the Keeper setup - if you put the nodes for the two systems in the same machine, CH will inevitably push Keeper off the bed and the two will come to a disagreement. You should not have it on a k8s Pod for that reason, for example. But then again, you shouldn't have ANY storage of that capacity in a k8s pod anyways.

geysersam · on Oct 22, 2024

DuckDB probably performs better per core than clickhouse does for most queries. So as long as your workload fits on a single machine (it's likely that it does) it's often the most performant option.

Besides, it's so simple, just a single executable.

Of course if you're at a scale where you need a cluster it's not an option anymore.

zX41ZdbW · on Oct 22, 2024

The good parts of DuckDB that you've mentioned, including the fact that it is a single-executable, are modeled after ClickHouse.

RyanHamilton · on Oct 22, 2024

Can you provide a reference for that belief? To me that's not true. They started from solving very different problems.

geysersam · on Oct 23, 2024

I didn't express myself well. What I meant to say was that Duckdb runs a single process. That simplifies things.

Clickhouse typically runs several processes (server, clients) interacting and that already makes things more complicated (and more powerful!).

That's not to say one is good and the other bad, they're just quite different tools.

PeterCorless · on Oct 22, 2024

Note that every use case is different and YMMV.

https://www.vantage.sh/blog/clickhouse-local-vs-duckdb

hn1986 · on Oct 22, 2024

Great link . Curious how it compares now that Duckdb is 1.0+

theLiminator · on Oct 22, 2024

Not to mention polars, datafusion, etc. Single node OLAP space is really heating up.

fiddlerwoaroof · on Oct 22, 2024

Clickhouse scales from a local tool like Duckdb to a database cluster that can back your reporting applications and other OLAP applications.

CalRobert · on Oct 22, 2024

Clickhouse and Postgres are just different tools though - OLTP vs OLAP.

fiddlerwoaroof · on Oct 22, 2024

It’s fairly common in my experience for reports to initially be driven by a Postgres database until you hit data volumes Postgres cannot handle.