How to run better and more intuitive A/B tests using Bayesian statistics

A guide to Bayesian A/B testing with examples and code

Robbie Geoghegan

Published in

TDS Archive

7 min readJan 3, 2021

Why you should use Bayesian A/B Tests instead of traditional approaches

A/B testing is one of the most useful statistical techniques used in technology, marketing and research today. Its value is that A/B testing allows you to determine causation, while most analysis only uncovers correlation (i.e. the old adage “correlation not causation”). Despite the power and prevalence of A/B testing, the vast majority follow a single methodology based on t-tests that draws from the frequentist school of statistics. This note will walk through an alternative approach using the Bayesian school of statistics. This approach returns more intuitive results than the traditional, frequentist approach as well as some useful, additional insights.

The traditional, frequentist approach uses hypotheses as the framework for an A/B test. The null hypothesis is often the status quo, e.g. mean of A is equal to mean of B, and the alternative hypothesis tests whether there is a difference, e.g. mean of A is larger than mean of B. A confidence level, e.g. 5%, is selected and the experiment can have one of two conclusions:

We reject the null hypothesis and accept the alternative hypothesis with 95% confidence, e.g. mean of A is larger than mean of B, or
We do not reject the null hypothesis with 95% confidence, i.e. we can make no conclusion about the difference in means between A and B.

This sort of language isn’t how we tend to speak in business and can be difficult to understand for people less familiar with A/B testing. In particular, the second conclusion doesn’t provide much insight; having spent time and money to run a test you are only able to conclude that no conclusion is possible. (For more information about this approach check-out my previous post about implementing A/B tests in Python).

The Bayesian approach instead focuses on probabilities. If testing the same example above, the null hypothesis being mean of A is equal to mean of B, the Bayesian method calculates the estimated difference in means as well as the probability one is larger than the other — rather than just whether the difference in means is 0. In my opinion, the Bayesian approach is superior to the frequentist approach because it can effectively accept and reject the null hypothesis with a specific probability. This approach makes for more useful recommendations. Two example conclusions (analogous to the frequentist conclusions) are:

Mean A has a 99% probability of being larger than Mean B (this example would have rejected the null hypothesis)
Mean A has a 65% probability of being larger than Mean B (this example would not have rejected the null hypothesis)

This sort of language gives a sense of how probable the conclusions are so that decision makers are empowered to choose their own risk tolerance and it avoids situations where the null hypothesis cannot be rejected and no conclusion drawn.

Even more useful is that it calculates the estimated difference between means. Together this means a possible conclusion from a Bayesian test is “Mean A is estimated to be 0.8 units larger than Mean B and there is an 83% probability Mean A is larger than Mean B”. As a bonus, the Bayesian approach also enables comparisons between variances of A and B and inherently manages for outliers.

The drawback to the Bayesian approach is that the mathematics underpinning it can be more challenging. A good understanding of Bayesian statistics and Markov chain Monte Carlo sampling is helpful but not completely critical.

The following sections walk through an example of how to use the Bayesian approach for A/B testing and code examples in R.

Overview of data for the A/B test

To demonstrate the Bayesian approach I’ll use data from a set of surveys I performed in early 2020. The surveys comprised 13 questions around 3 themes regarding respondents’ opinions on the measures implemented to combat Coronavirus (4 questions), respondents’ approval of the government response to Coronavirus (3 questions) and general household activity questions (5 questions). A full list of questions is included here. For this example we’ll focus on questions that had a numerical answer such as “How many hours a day are you spending with your family members or roommates?”

The surveying was designed to comprise 6 similar but distinct survey versions. The purpose of running these slightly different surveys was to A/B test whether the differences between them caused statistically different results. The difference between each survey was either the ordering of questions or the way questions were phrased in either a positive or negative way. An example of a positively worded vs negatively is:

Positive: For how long do you expect the government advised social distancing to last beyond today?
Negative: For how long do you expect the government mandated social distancing to last beyond today?

The table below shows a summary of the different survey versions. 291 survey responses were recorded in total comprising 45–47 responses for each of the survey versions. This means the results of Survey 1 can be compared with Survey 3 and Survey 5 for differences in ordering and with Survey 2 for differences in wording.

Bayesian Analysis

The following analysis is largely based on the 2012 research paper “Bayesian Estimation Supersedes the t Test” by Kruschke and the R package BEST. Code is available in my Github.

This Bayesian technique, as with any Bayesian estimation, draws on a set of prior beliefs which are updated with evidence from data to return a set of posterior distributions. The following analysis uses a t distribution and Markov Chain Monte Carlo algorithm per Kruschke - 2012 and a noncommittal prior that has limited impact on the posterior distribution. The noncommittal prior has a minimal impact on the posterior distributions which is useful for this study as there was no baseline or set of prior beliefs this study could easily compare against. This methodology is also effective for managing outliers and required adjusting for only one data point that was an error.

If the previous paragraph is a bit complex, don’t worry. You can still go through the steps below and get an easy to interpret output. To learn more, read the paper by Kruschke.

Step 1: Load packages and data

First step is to install the required packages. We will be using the BEST package that uses the JAGS package. Download JAGS first prior to running BEST. Next install BEST. Once this is all completed, load the packages.

Also load in the data and set it up for analysis. We’re using survey_data_v2.csv which can be found here.

Load packages and data

Step 2: Create the function for Bayesian analysis

Next we want to create a function that will allow us to choose which survey versions are to be compared and which survey question the test will compare. The function runs a Markov chain Monte Carlo sampling method that constructs a posterior distribution of our test, i.e. the probability one mean is larger than the other and the estimated difference in means.

Create the function for Bayesian analysis

Step 3: Run test

Finally, select the two sets of data to compare. In this example we will use survey version 1 and 2, and compare question 2. Change the function variables to test different surveys and questions.

Run test

Step 4: Interpret output

After running the above code a pop-up will show the following output. It mainly shows histograms of 100,000 credible parameter-value combinations that are representative of the posterior distribution.

The most important output for A/B testing is the mid-right distribution that shows the difference of means. For our example, it shows that on average Mean A is 0.214 units larger than Mean B and that Mean A has an 82.9% probability of being larger than Mean B. This result is the main conclusion for the A/B test. Note, a traditional t-test would have simply returned the result that we can’t reject the null hypothesis at the 95% level of confidence.

The other output shows other useful information for interpreting the data. The two top-right graphs with y as the axis show the actual distribution of test data. The other figures show the posterior distributions. The five histograms on left hand side figures show the individual posteriors corresponding to the five prior histograms. The bottom-right graphs show comparisons between the groups A and B.

Key takeaways

The Bayesian approach to A/B testing has three key benefits over the traditional, frequentist approach:

A more intuitive set of results, e.g. Mean A has an 82.9% chance of being larger than Mean B.
Includes the size of the difference between A and B, e.g. Mean A is estimated to be 0.214 units larger than Mean B.
Is not constrained by a result where the null hypothesis is not rejected.

These benefits combine for more useful and intuitive recommendations that empower decision makings to better understand test results and select their own level of risk.

Support me by buying my children’s book: mybook.to/atozofweb3

Github Repository

https://github.com/bondicrypto/bayesian_abtesting

References

[1] Kruschke, John K. “Bayesian Estimation Supersedes the t Test.” Journal of Experimental Psychology . Vol. 142, no. 3, 2012, pg. 573–603, accessed 03 January 2021, <https://cran.r-project.org/web/packages/BEST/vignettes/BEST.pdf>

[2] Gallo, Amy 2017. A Refresher on A/B Testing , Harvard Business Review, accessed 03 January 2021, <https://hbr.org/2017/06/a-refresher-on-ab-testing>

[3] Hussain, Noor Zainab and Sangameswaran, S. 2018, Global advertising expenditure to grow 4.5 percent in 2018: Zenith , Reuters, accessed 03 January 2021, <https://www.reuters.com/article/us-advertising-forecast/global-advertising-expenditure-to-grow-4–5-percent-in-2018-zenith-idUSKCN1M30XT>

[4] Lavorini, Vincenzo, Bayesian A/B Testing with Python: the easy guide, Towards Data Science, accessed 03 January 2021,<https://towardsdatascience.com/bayesian-a-b-testing-with-python-the-easy-guide-d638f89e0b8a>

[5] Mazareanu, E. 2019, Market research in U.S. — Statistics & Facts , Statista, accessed 03 January 2021, <https://www.statista.com/topics/4974/market-research-in-us/>.

[6] NSS 2016. Bayesian Statistics explained to Beginners in Simple English , Analytics Vidhya, accessed 03 January 2021, <https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/>