What we implement is much closer to this suggestion. We move the code out to a c...

What we implement is much closer to this suggestion. We move the code out to a couple machines from each of our different web frontend servers. After a minute we compare before and after across numerous metrics (load average, cpu, errors, page failures, etc). If the revision passes, we roll it out to 100% of the cluster and do the same monitoring for 5 minutes.

Originally we intended to put business metrics in these tests, but it turns out we regress on them via code changes rarely and it takes a human to figure out what went wrong. Instead we test business metrics (and lots of other stuff) via nagios, which gives us 1-5 minute sampling frequency, good enough for most of our issues.

I did not cover how you would iterate towards the ideal (instant deploy) and what concessions you might have to make. Our 6 minute deploy is actually quite inefficient, but it's not the bottleneck to our deploy system.

(If you're wondering what our bottleneck is, our automated tests take 9 to 12 minutes despite being spread across 40 machines... Selenium in Internet Explorer is slow.)