What we implement is much closer to this suggestion. We move the code out to a couple machines from each of our different web frontend servers. After a minute we compare before and after across numerous metrics (load average, cpu, errors, page failures, etc). If the revision passes, we roll it out to 100% of the cluster and do the same monitoring for 5 minutes.
Originally we intended to put business metrics in these tests, but it turns out we regress on them via code changes rarely and it takes a human to figure out what went wrong. Instead we test business metrics (and lots of other stuff) via nagios, which gives us 1-5 minute sampling frequency, good enough for most of our issues.
I did not cover how you would iterate towards the ideal (instant deploy) and what concessions you might have to make. Our 6 minute deploy is actually quite inefficient, but it's not the bottleneck to our deploy system.
(If you're wondering what our bottleneck is, our automated tests take 9 to 12 minutes despite being spread across 40 machines... Selenium in Internet Explorer is slow.)
Originally we intended to put business metrics in these tests, but it turns out we regress on them via code changes rarely and it takes a human to figure out what went wrong. Instead we test business metrics (and lots of other stuff) via nagios, which gives us 1-5 minute sampling frequency, good enough for most of our issues.
I did not cover how you would iterate towards the ideal (instant deploy) and what concessions you might have to make. Our 6 minute deploy is actually quite inefficient, but it's not the bottleneck to our deploy system.
(If you're wondering what our bottleneck is, our automated tests take 9 to 12 minutes despite being spread across 40 machines... Selenium in Internet Explorer is slow.)