Tips for a healthier Postgres database (2021)

gnfargbl · on Feb 11, 2023

pg_stat_statements is useful, but it isn't completely free, e.g. [1] (admittedly from pg12, which is three major versions ago).

Most people should probably leave it enabled, but if you have a database with a very defined and understood query pattern, then it might be worth running perf and checking if it is costing you.

[1] https://www.alibabacloud.com/blog/postgresql-v12-how-pg-stat...

hummus_bae · on Feb 11, 2023

Most modern processors run at a speed that makes it cheaper to check than to not check. Some people might argue for lower-level optimizations like this, but in this case, I don't think it's a big deal.

Bottom line: enabling this doesn't hurt. It might help, and it doesn't cost anything to check

hyperman1 · on Feb 11, 2023

Wondering about pgbouncer: A postgres connection can have state attached, e.g. with 'set'. Does pgbouncer save and restore state, with associated perf cost, or does any pooled connection just receive the random state from the previous connection owner, assuming all state should be identical?

sqlcommando · on Feb 11, 2023

pgbouncer supports session and transaction pooling. With session pooling a connection will be associated with the same session throughout its use. With transaction pooling each transaction on the connection could potentially get a different session.

atombender · on Feb 11, 2023

It's configurable. The default in "session pooling" mode is to run DISCARD ALL to reset the connection for reuse.

mixmastamyk · on Feb 11, 2023

What does crunchy cost? Page says to ask, typically meaning you can’t afford. Wonder if it is feasible for a small company. Is it competitive with RDS?

doodlesdev · on Feb 11, 2023

The starting prices for hobby instances [0] (i.e. development) are about half of what you would pay for AWS RDS [1]. If you choose to have high availability then prices are equal in this case. This is for Crunchy Bridge, which has a calculator on the website. Although, similar to Amazon RDS the prices increase dramatically as you get a decent machine to run your database and increase storage size. I'm using us-east-1 for the comparison.

Honestly, as always, I think these managed database services are way too expensive. If I really needed to choose one of them I would go with the AWS service, because you also have to factor in data egress costs in your cloud costs calculation which will vary a lot depending on your workload.

[0]: https://www.crunchydata.com/pricing/calculator?provider=aws&...

[1]: https://calculator.aws/#/addService/RDSPostgreSQL

BasedInfra · on Feb 11, 2023

They have a hosted on major clouds service called crunchybridge which seems reasonably priced.

https://www.crunchydata.com/pricing/calculator

whitepoplar · on Feb 11, 2023

https://www.crunchydata.com/pricing/calculator

candiddevmike · on Feb 11, 2023

It's quite a bit more than RDS/cloud provider managed. I don't see a reason to use them vs a cloud provider native service if you're already in the cloud provider hotel california.

craigkerstiens · on Feb 11, 2023

Product manager here from Crunchy, we aim to be quite in line with any of the major cloud providers. A number of customers have moved over from each of the major cloud providers to see better price to performance and quality of support.

karmelapple · on Feb 11, 2023

Just as a data point: we are planning to move to Crunchy and away from Heroku soon. We tested the two, and we have 4X improved performance on Crunchy Postgres vs Heroku Postgres.

4X. We were astounded. And so we’re the Heroku folks I talked to, who I’m guessing will want to try to bring performance boosts.

We haven’t done the switch quite yet because we’re waiting on Crunchy to bring out their Heroku Dataclips replacement, which is something we rely on a lot. But once it’s ready, we’re going to switch!

craigkerstiens · on Feb 11, 2023

We're used to seeing performance gains, but also wow. This is great to hear, so thanks for sharing.

Dataclips equivalent is coming, we're hoping to release sometime in March.

karmelapple · on Feb 13, 2023

That's great news! We're happy to try it out, since that's the one thing holding us up from making a bunch more big databases on your platform :)

higeorge13 · on Feb 11, 2023

Just curious, what’s special with heroku dataclips? You can host and use any bi tool which has similar functionality (last one i used was redash)

karmelapple · on Feb 13, 2023

We can! But setting up Heroku Dataclips takes precisely 0 working days/minutes/seconds. So it's sheer laziness :)

mixmastamyk · on Feb 11, 2023

Great. Wonder if you offer any a la carte consulting on modeling, performance, etc? Wouldn't need it often, but rather occasional "ask an expert" when when hitting a bottleneck. This is a bit different than support.

craigkerstiens · on Feb 11, 2023

We don't do this so much a la carte for non customers, but definitely have it available for any of our existing customers. It's a common request, and we in general just stay very busy with it already. For customers I often tell them start with a support request to start... often in 15-30 minutes we can help give some insights that are helpful without having to go for a bucket of consulting hours, it just depends what and how much folks need.

mixmastamyk · on Feb 12, 2023

Thanks, yes definitely meant as an already customer. In other words a single contact for db things.

efxhoy · on Feb 11, 2023

One thing I really like about RDS is the query visualizer. It shows which queries ran when and for how long and what they spent their time waiting for, CPU or disk or locks or whatever. It’s really good. Are there any free tools that show a similar view?

edit: i think it’s called performance insights in RDS

torvald · on Feb 11, 2023

Before we jumped on all kinds of SaaS offerings we used pgbadger.

https://github.com/darold/pgbadger

ekkeke · on Feb 11, 2023

Would be a whole lot better if it didn't prune their query logs to x (I think 1000?) characters.

We autogenerate many of our own queries which can have significant complexity (regularly over 10 joins, sometimes up to 30!) and our infrastructure isn't quite there yet to be able to recreate the exact query plan a customer saw on their own data without a lot of work. It could all be so much simpler, so if there is a setting to prevent this please tell me!

taspeotis · on Feb 11, 2023

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...

kaveet · on Feb 11, 2023

You might be looking for the track_activity_query_size parameter:

  SHOW track_activity_query_size;

pjdkoch · on Feb 11, 2023

This is missing a [2021].

gregwebs · on Feb 11, 2023

Here’s a newer article with much the same content: https://www.crunchydata.com/blog/exposing-postgres-performan...

cynicalsecurity · on Feb 11, 2023

That was a rather short article.

napsterbr · on Feb 11, 2023

Which is a great thing. If this were on YouTube, we'd have a 20 minute video, with a long and unnecessary introduction and a sneaky 1 minute ad at the middle of the video.

dexterdog · on Feb 11, 2023

That's why sponsorblock exists

napsterbr · on Feb 11, 2023

I'll definitely check this out, ty! Currently my sponsorblock is my right thumb always ready to +10s as soon as I smell bullshit.

didgetmaster · on Feb 11, 2023

>There seems to be a common lifecycle of indexes within applications. First you start off with almost none, maybe a few on primary keys. Then you start adding them, one by one, two by two, until you've got quite a few indexes for most any query you can run. Something is slow? Throw an index at it. What you end up with is some contention on overall throughput of your database, and well a lot of indexes that became a tangled ball of yarn over time.

I spent years doing index maintenance on a large corporate database. Creating and rebuilding indexes was a hassle. So I set about building a new kind of database engine where every column in a table organizes its data for fast access (e.g. a columnar store with no separate indexing structure).

I expected my system to be substantially faster than a Postgres table without indexes for a broad range of queries. What I didn't expect was for it to also be much faster for tables that were highly indexed in Postgres. I ran tons of queries on both systems and almost all of them were like this video:

https://www.youtube.com/watch?v=OVICKCkWMZE

CharlesW · on Feb 11, 2023

You self-promote here often¹ (not judging, just an observation), and here are some thoughts that have occurred to me while reading your posts and comments.

• Your vision for Didget is blurry. Sometimes it's a database (or "general purpose data management system"), sometimes it's a file system. Products that are dessert toppings and floor waxes rarely find purchase. My advice: Decide which problem domain you're playing in, then which problem(s) within that domain this solves.

• The Postgres comparison seems naive. As the Postgres wiki notes, "PostgreSQL ships with a basic configuration tuned for wide compatibility rather than performance. Odds are good the default parameters are very undersized for your system." A comparison with a system-appropriate Postgres configuration would be more interesting.

• Promoting a new closed-source database seems anachronistic when there's a plethora of fine open-source options. If you're wondering why you're not getting many leads from HN, I'd imagine this is the main reason.

¹ https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

didgetmaster · on Feb 12, 2023

Whenever I mention my project's website in an HN comment, I am 'self-promoting' so you are correct. In my experience that is not something that is frowned upon on HN, since I see this kind of thing all the time (the article I commented on was promoting another product).

Didgets, as it exists today, is a technology rather than a 'product'. It consists of a set of highly optimized data objects that can be used to build complex systems like hierarchical file systems, relational database tables, logging frameworks, configuration managers, or content indexers. I have implemented enough functionality into the browser application to prove that it can do those things well, but none of them are fully implemented yet. I am trying to find the right product-market fit.

I have not yet open-sourced any of the code, but I probably will once I decide which open source licensing is best for it. I am simply trying to introduce this technology to others to see if there is any interest in it.

Note: While I didn't explore every configuration option in Postgres so I don't have confidence that it ran as fast as it possibly could, I did not just use the default configuration. I tried all the usual tricks to speed it up.

CharlesW · on Feb 13, 2023

> It consists of a set of highly optimized data objects that can be used to build complex systems like hierarchical file systems, relational database tables, logging frameworks, configuration managers, or content indexers.

That sounds like a "general-purpose database", unless there's something that disqualifies Didget from being that. If it's a database, you're probably competing with something closer to the free and open-source SQLite than you are the free and open-source Postgres. This can be done (see rqlite), but in any case it seems like being free and open-source would be necessary (although not sufficient) for people to consider it.

> I tried all the usual tricks to speed it up.

I don't know what "usual tricks" means (maybe you have a link?), but I think it would add some credence to performance claims to specify what you're doing.

chasil · on Feb 11, 2023

In Oracle, Tom Kyte's advice is to never rebuild them, unless there are reasons for structural change or physically relocating the storage.

I have occasionally seen performance gains by rebuilding a composite index with key compression, for example.

Indexes do tend to return to their natural size in heavy use, so repeatedly shrinking them will just tax performance for a temporary gain in storage.

https://asktom.oracle.com/pls/apex/f?p=100:11:::::P11_QUESTI...

MuffinFlavored · on Feb 11, 2023

What's the gotcha other than "it's a homegrown database engine"? Did you start from scratch? It can do JOINs and stuff?

Bogdanp · on Feb 11, 2023

One potential hint is the Postgres queries are being executed as sequential scans, as you can see from the explain output. The first query only returns 200k rows (out of 7M), so even without any config tweaks it should be using an index if one is present. So, my guess is those queries are being made against unindexed columns. Also, I'm guessing this is vanilla Postgres, so it's possible OP hasn't tweaked its config for the machine they're using, and the default config pg ships with is designed for fairly resource-constrained machines by today's standards. If the data is indexed, it's possible the rows are all over the place, and clustering could help.

skunkworker · on Feb 11, 2023

In addition, I wonder if random_page_cost is not set to be 1. I've noticed on machines with SSD only pools that makes the query planner much more deterministic and utilizes indexes over sequential scans almost unused.

didgetmaster · on Feb 12, 2023

I would love it if even one of the database experts who are so sure that the comparisons are somehow rigged, would actually run their own benchmark to prove it wrong instead of just speculating.

Get any moderately sized table (millions of rows and a dozen or more columns) on your 'highly-tuned' postgres database with all the proper indexes in place and load that same data into Didgets (takes a whole 10 minutes to download the software and start using it) and run whatever queries suits your fancy.

Bogdanp · on Feb 12, 2023

My post doesn't assume bad intent. As the new entrant in a space, the onus is on you to show your work. So, if you want people to run their own benchmark then share the data and configs you used.

I actually tried to find the dataset you used yesterday to run my own benchmark, but the only official "Chicago crime database" I could find had only 300k rows and 19 columns, so I assumed it wasn't the same data you mention in your video.

didgetmaster · on Feb 12, 2023

Here is the link where you can download the latest crime data for the Chicago police department (just export to CSV format):

https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-...

Even though I replied to your post, it was more of a general observation than to you specifically. My video has been viewed over a thousand times. I get comments all the time that say it can't be real or that I must be handicapping the other databases (I also have a video comparing to SQLite) by not configuring them correctly.

I have yet to have a single person who claims to be a database expert try it out on their own favorite data set and tell me that they found Didgets to be slower than their preferred database engine.

didgetmaster · on Feb 12, 2023

The engine was created from scratch. It can do JOINs but hasn't yet implemented all the different kinds of joins.

It is a completely different architecture (originally designed to be a file system replacement where multiple tags could be attached to each and every file) and the database functionality was almost discovered by accident. The tags I invented to make finding files based on them extremely fast, looked a lot like a columnar store. So I tried building regular relation tables using them. It surprised me when queries against my tables gave other databases (with decades of development behind them) a real run for their money.

SoftTalker · on Feb 11, 2023

> First you start off with almost none, maybe a few on primary keys.

Primary keys are always indexed. Does the author really know what he’s talking about?

eropple · on Feb 11, 2023

A more generous interpretation would be the assumption that primary keys are (as they frequently are) synthetic data, and perhaps not the most commonly queried field aside from direct get-that-resource queries.

Petersipoi · on Feb 11, 2023

The author clearly knows this based on the sentence you quoted. I’m very confused as to what your complaint is.

> I have almost no fruit, just a few apples.

Indicates that I do know that apples are a fruit. Not that I don’t.

Petersipoi · on Feb 11, 2023

GP edited their comment after I responded and didn’t mention it. Completely changed the comment. Was very antagonistic to the author of the article before.

dewey · on Feb 11, 2023

What about tables without primary keys?

rzzzt · on Feb 11, 2023

DBMS-s still have to identify a particular row somehow. That identifier is CTID in Postgres and ROWID in Oracle. Both will change if the data is moved around.

lgas · on Feb 11, 2023

Generally not a good idea to have these at all.

wielebny · on Feb 11, 2023

These are usually stored in CSV files.