I wonder if you could do some automated A/B system here, and instead of every co...

TimothyFitz · on Feb 9, 2009

What we implement is much closer to this suggestion. We move the code out to a couple machines from each of our different web frontend servers. After a minute we compare before and after across numerous metrics (load average, cpu, errors, page failures, etc). If the revision passes, we roll it out to 100% of the cluster and do the same monitoring for 5 minutes.

Originally we intended to put business metrics in these tests, but it turns out we regress on them via code changes rarely and it takes a human to figure out what went wrong. Instead we test business metrics (and lots of other stuff) via nagios, which gives us 1-5 minute sampling frequency, good enough for most of our issues.

I did not cover how you would iterate towards the ideal (instant deploy) and what concessions you might have to make. Our 6 minute deploy is actually quite inefficient, but it's not the bottleneck to our deploy system.

(If you're wondering what our bottleneck is, our automated tests take 9 to 12 minutes despite being spread across 40 machines... Selenium in Internet Explorer is slow.)

inerte · on Feb 8, 2009

What faults?

I ask because I swear A/B and multivariatate tests have been around my head a lot lately, and when I finished reading the article, the first thing I thought was: Why not just deploy to 1% of the users and see if it works?

Then I thought about how hard would be to manage multiple versions of the same software, specially data, amongst different user. Certain features presented to the 1% might be incompatible with the other 99%. But that's a technical problem. Very hard to solve, but manageable. Then I imagined somekind of framework that would make communication between different versions of the data floating around easier, with "how to transform" data from-and-to version 1.1 and 1.2 easily.

Anyway, I am really curious, because it sounds like a good solution :)

DenisM · on Feb 9, 2009

As far as data transofmration goes you have two options:

1. Something akin to ActiveMigration is RubyOnRails world. This allows going back and forth different versions of your data's schema.

2. Use a more open data scheme such as Google App Engine uses where adding/removing properties to an object is not as disruptive compared to SQL-based solutions.

TimothyFitz · on Feb 10, 2009

It turns out the system works fine with SQL based alters. We do have to do real work to deploy expensive alters (apply them to standbys, fail over, repeat, or worse) but in general it's cheap to change schemas.

Unfortunately, it's very manually intensive to roll back schema changes, so it's one of the few places where we put old school process in place (a DBA who reviews all schema changes prior to deployment)

DenisM · on Feb 10, 2009

Did you look at ActiveMigration? Even though I never used ruby/ror I found it a very sensible approach to thinking about schema evolution.

TimothyFitz · on Feb 11, 2009

ActiveMigration really solves a different problem. Our problem is that adding indexes or altering popular tables is impossible to do on a live and in production database. To get those changes out we have to go through quite a bit of extra work. It's really a MySQL limitation, not a process problem.

DenisM · on Feb 9, 2009

This is an excellent idea, please continue to explore it.

As my token contribution - Google App Engine allows storing and accessing several versions of the app (access through different subdomains). Perhaps one coudl use DNS to trick different users into seeing different app versions? Not quite what you wanted, but down the right path.