A/B tests are tools for using data to learn about a mechanism of action. They play a role in a multi-step process for optimizing a metric. In this process:
- You formulate a hypothesis about what might cause a metric to go up or down
- You collect data in an experiment to measure some effect that validates or refutes the hypothesis
- After validating or refuting your hypothesis, you can ship the appropriate variation to realize the gains on the metric you measured in the experiment
Multi-armed bandits (MABs), on the other hand, are one-stop shops that seemingly roll all three of the above steps into a single package. So how should you think about when to take a measured, A/B testing approach versus an automated MAB approach?
Ephemeral effects
Long story short: MABs are most useful when there are likely ephemeral effects at play that would cause learnings from an A/B test to quickly become irrelevant over time.
For example, consider the following scenarios:
- Sales promotions that are only valid for a fixed period of time (e.g., Black Friday)
- A machine learning system operating in an environment subject to rapidly changing rules and regulations
In both cases, running an A/B test is potentially inefficient. By the time you finish collecting enough data to be confident that one variation is better than the others, your window for capitalizing on that learning by shipping that variation may be gone—because the sales promotion time window has passed, or rules and regulations have changed so much that the winning variation you identified is now no longer compliant.
In that case, it would be more efficient to have an MAB start exploiting leading variations sooner, by giving them more traffic earlier, thereby taking advantage of the limited time window. This concept is known as regret minimization: the delicate balancing act of adjusting traffic allocations so that all variations have enough data for us to distinguish whether they are superior or not, while also pushing more traffic to the leading variations so you don’t miss your limited window of opportunity.
Precision measurement
Using MABs is not recommended when you require a precise measurement of the effect that a variation has on a metric over the control. This is because MABs are susceptible to a phenomenon known as Simpson’s Paradox, a measurement bias that occurs when seasonal effects (changes in user behavior tied to time-based patterns) are combined with shifting traffic allocation.
A Simpson’s Paradox example
Suppose you run a multi-armed bandit for two days to optimize the CTR (click-through rate) on a webpage. (I use CTR to make the example easier, but the same principle applies to non-binary metrics such as latency in milliseconds, etc.).
There are two variants: control and treatment. Suppose also that you can wave your hands to know the “true” CTR on the page from day to day. Suppose additionally that this changes from day 1 to day 2, like so:
| Day 1 | Day 2 |
---|---|---|
Control | 10% | 20% |
Treatment | 11% | 22% |
Think of this as a seasonal effect going from weekday to weekend, e.g., Day 1 = Friday and Day 2 = Saturday. Shoppers arriving on Saturday might be more likely to click on a promotion than shoppers arriving during the work week. Crucially, note that the relative difference between treatment and control on both days (and thus overall) is constant: the treatment converts 10% better than the control.
If the traffic allocation is constant over these two days, as in an ordinary A/B test (e.g., 50/50 split), then you might see the following results:
| Day 1 | Day 2 | Cumulative |
---|---|---|---|
Control | 10% (100 users) | 20% (100 users) | 30/200 = 15% |
Treatment | 11% (100 users) | 22% (100 users) | 33/200 = 16.5% |
Note that, in this case, the final cumulative conversion rate difference is still 10%: (16.5 - 15.0) / 15.0 = 0.10 or 10% relatively speaking.
Now, suppose you have a super-aggressive MAB that sees the treatment leading on Day 1 and then decides to shift 90% of the traffic to the treatment for Day 2:
| Day 1 | Day 2 | Cumulative |
---|---|---|---|
Control | 10% (100 users) | 20% (20 users) | (10+4)/120 = 11.7% |
Treatment | 11% (100 users) | 22% (180 users) | (11+39.6)/280 = 18.1% |
Now the relative difference between Treatment and Control is (18.1 - 11.7)/11.7 = 0.547 = 54.7%! By directing traffic to the better variation, it appears that the gap between the treatment and control has been inflated by over five times.
To summarize, pushing traffic around in the presence of varying underlying conversion rates can cause MABs to show skewed estimates of the gap between variations. This is not as big of an issue with MABs because, as mentioned above, the chief aim with MABs is to exploit a limited window of opportunity rather than provide accurate estimates for long-term learning.
Takeaways
A/B experiments and MABs are both useful tools for an optimization-oriented practitioner. However, MABs in particular have a clear niche, which gives them an advantage in particular situations relative to experiments.
A/B experiments remain the gold standard for measuring the effect of a variation over a baseline, offering simple, transparent, and accurate results. The results are immune to Simpson’s Paradox, and they encourage a scientific approach (create hypothesis > validate/refute with data > iterate)—which lays a solid foundation for a data-driven culture.
MABs are perfect for those times when a measured, scientific approach is too slow. Limited-time promotions or models facing rapidly changing environments are great scenarios for deploying an MAB, sitting back, and letting the algorithm optimize your primary metric.