This may sound familiar: you have been working on a new feature for weeks or months and excitedly release it to your users! Sadly, in the weeks following the release you notice a disturbing trend. A core metric that tracks user interest in your service is steadily decreasing!
What do you do? Do you roll back the feature update?
Since releasing the update, your product has changed through other deployments. Your users have changed. The world has changed. Can you be confident that the initial feature update is the root cause? Unless you release your feature update through an A/B test it may be impossible to determine if the update caused the metric change.
At their simplest, A/B tests randomize users into two groups.
During the A/B test one group sees the original variant and the other sees the new version (like in the A/B test by Azure Identity shown below).
External impacts (like additional product changes or world events) affect users in both groups. The same aggregated metrics can be directly compared between the two variants. Because effects of all external factors are expected to impact two randomly-selected groups equally, the new variant can be isolated as the root cause of any significant difference in aggregated metric value between groups A and B.
For the earlier case of the declining metric, the feature update was released through an A/B test so we can look at the difference between the two groups over the same time period. When we make the direct comparison, we see that the feature update was not responsible for the degradation: engagement among users seeing the old variant degraded at the same rate as among users in the new variant, so the difference between the two groups is indistinguishable from zero. The A/B test data shows that rolling back the feature would not reverse the decline.
A/B Testing in the era of Covid-19
Scenarios like this one became fairly common in Microsoft around March of 2020. Indeed — the metric movement charted above is an example from a Microsoft product’s A/B test that started in early March. During this period, many people went through unprecedented upheaval in their daily routine as they lost jobs, reduced hours, or started working/studying from home. People’s lives changed immensely, and so their engagement with products and services changed too.
Although releasing and testing product features is a trivial concern compared to the much larger ones facing our communities and societies, in times of stress it is even more crucial to ensure that any changes to your products or services truly benefit users. A/B testing is uniquely suited to safe-guard user experiences. [1]
In the remainder of this post we share guidance for using A/B testing in times of uncertainty. This is framed specifically in terms of A/B-testing during Covid-19 disruptions — but much of this guidance applies more generally since smaller (and therefore harder to detect) external influences on product usage are ever-present.
Is it safe to run A/B tests now?
Even if your team has not restricted deployments, it is still important to be cautious during this period of added stress on users, colleagues, and infrastructure. For features that you do release, remember that A/B testing is the safe option: it is less safe to release a new feature WITHOUT A/B testing and using progressive safe rollout — especially in periods of stress.
I am releasing features focused on ensuring performance and reliability. Should I still run A/B tests on these?
Even bug fixes can have unintended consequences. In Microsoft’s ExP team we have learned it is always better to measure and understand the impact of a code change before fully deploying it.[2] Use A/B tests to safely roll out and monitor code changes even if you don’t expect them to impact end-user behavior.
I ran an A/B test: are the results valid now?
The values of some metrics will be notably different than pre-Covid. For example, user time on MS-Teams is substantially higher than it was in early March, while Maps directions query usage has likely dropped. Can A/B tests really isolate the impact of the change? Because the two groups in your A/B test were assigned randomly, external factors (like Covid-19) impact the two groups equivalently: the only feature consistently separating the two groups is your code change. Proper A/B testing isolates the new variant effect, so you can be confident a detected difference between the variants is related to your implementation.
I ran an A/B test: will the results be valid once “normal” life resumes?
We can’t know. User behavior always changes over time, so it is never certain whether the impact of a feature will persist 2, 3, or 12 months from now. A/B tests help us to optimize features for current usage patterns. But rarely have people’s activities, work style, and concerns changed so drastically. Don’t assume that usage will stabilize in the same patterns they follow now, or in the patterns you observed pre-Covid. If you run an A/B test now and decide to fully deploy, consider re-running the A/B test comparing the old variant again as circumstances evolve. Re-running the A/B test will allow you to see if the treatment effect has substantially changed. Consider this especially if:
- The results were unexpected or contradict earlier testing.
- The feature re-prioritizes certain result categories (e.g. promoting search results for local venues or travel likely has a very different impact now than it would normally).
- Models underlying the feature were trained on user data from this period (e.g., recommendation algorithms trained on recent data).
- User segments vary substantially in their response to the new feature. For example:
- New vs. Experienced Users. For many products, the current ratio of new users is different than normal. This may impact treatment effect measurements for features that depend on a user’s familiarity with the product, or ones specifically targetted at new users (e.g., first-run call to action).
- Different geographic regions. Although every region of the globe is impacted by Covid-19, the level of impact and response is not homogeneous. Differences in treatment effect across regions could (but do not necessarily) indicate differences based on pandemic response.
In every case, think deeply and critically about your features, and whether the safeguard of re-running the A/B test is appropriate! (This advice is relevant outside of the Covid-19 crisis as well.)
Are there other benefits to A/B testing during this time?
Ensuring we don’t harm reliability, performance, and user experience is the most important function of A/B testing during this time. Even if the treatment effect varies over time, it is not advisable to ship features that harm current usage patterns; it is helpful to ship features that improve current user experience.
This period may also provide an opportunity to learn about users that are currently more active than usual: e.g. to learn whether an experience is more intuitive for new users. Increased usage may also increase the detection power of A/B tests, and increased system stress means you may catch performance degradation that normally impacts a smaller fraction of users.
My team has a backlog of features we held off on releasing. What’s next?
- Have a plan for A/B testing each feature independently. Suppose your organization releases a number of feature changes simultaneously and you notice improvement or regression in your metrics. It will be impossible to tease out the impact of each particular feature change. But if you set up each feature change with its own A/B test, you can deploy multiple feature changes at the same time. You’ll still be confident in your ability to measure the impact of each individual change.
- Consider a staggered release schedule. If many teams release new features simultaneously, users could be confronted with “hundreds of papercuts” from many changes across your product. A staggered release schedule reduces the impact to end-users.
- Decide whether any of your feature releases should be delayed. If a feature was designed to enhance business or life as it existed before the Covid-19 crisis, consider waiting. It may be best for your customers if you ship only once circumstances closer resemble those you designed for.
Clearly these are uncertain times. Make use of A/B testing to reduce the uncertainty around deploying new code.
– Jen Townsend, Senior Data Scientist, Microsoft Experimentation Platform
Learn more about A/B testing to safeguard user experience:
[1] T Xia, S Bhardwaj, P Dmitriev, and A Fabijan. 2019. Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout. (opens in new tab) In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 11–20. DOI:https://doi.org/10.1109/ICSE-SEIP.2019.00010
[2] Aleksander Fabijan, Pavel Dmitriev, Helena Holmstrom Olsson, and Jan Bosch. 2017. The Benefits of Controlled Experimentation at Scale (opens in new tab). In Proceedings of the 2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 18–26. DOI:https://doi.org/10.1109/SEAA.2017.47
[3] R Kohavi, D Tang, and Y Xu. 2020. Trustworthy Online Controlled Experiments: A practical Guide to A/B Testing (opens in new tab). Cambridge: Cambridge University Press. DOI:10.1017/9781108653985
[4] R Kohavi and S Thomke. 2017. The Surprising Power of Online Experiments: Getting the most out of A/B and other controlled tests (opens in new tab). In September-October 2017 Harvard Business Review, 74-82. DOI: https://hbr.org/2017/09/the-surprising-power-of-online-experiments