Stack Overflow runs on 9 web servers with (iirc) 48 logical cores (2 x 12-core Xeons) and 64GB RAM. Those servers are shared by a few apps (Talent/Job, Ads, Chat, Stack Exchange/Overflow itself) but the main app uses, on average, ~5% CPU. Those machines handle roughly 5000 requests/sec and were running .NET 5 as of September 2021 (when I moved on). That’s backed by 2 v. large SQL clusters (each consisting of a primary read/write, a secondary read-only in the primary DC and a secondary in the failover DC). Most traffic to a question page directly hits SQL - cache hit ratio tends to be low so caching in Redis for those hits tends to be not useful. As somebody mentioned below, being just a single network hop away yields really low latency (~0.016ms in this case) - that certainly helps being able to scale on little hardware - typically only 10
- 20 concurrent requests would be running in any instance at any one time because the overall request end-to-end would take < 10ms to run.
Back in full framework days we had to do a fair bit of optimisation to get great performance out of .NET, but as of .NET Core 3.1 the framework _just gets out the way_ - most memory dumps and profiling subsequent to that clearly pinpoint problem areas in your own app rather than being muddied by framework shennanigans.
Source: I used to work on the Platform Engineering team at Stack Overflow :)
That's surprising to read. Is that because of the sheer volume of question pages? I don't think I've ever been on an SO page that couldn't have been served straight from cache.
Is it? Most people come to SO from Googling their random tech problems/questions. Not sure how much value there is in caching my random Rails questions, etc
I would expect SO usage to follow a distribution like Zipfs — most visits hit a small subset of common Q/A, and there’s a ridiculously long tail of random questions getting a few visits where caching would do next to nothing. I’m fairly positive I’ve seen some post showing this was true for atleast answer-point distributions.
Though I guess it’s possible for a power distribution for page-likely-to-be-hit to still be useless for caching, because I think you could still get that distribution if 99% of hits are on nearly-unique pages; with a long enough tail, you’d still have only relatively few pages worth bothering to cache, but by far most visits are in the tail
Back in full framework days we had to do a fair bit of optimisation to get great performance out of .NET, but as of .NET Core 3.1 the framework _just gets out the way_ - most memory dumps and profiling subsequent to that clearly pinpoint problem areas in your own app rather than being muddied by framework shennanigans.
Source: I used to work on the Platform Engineering team at Stack Overflow :)