AB Testing (sometimes referred to as Split Testing) is a method of ascertaining the better of two options by assigning each player to one of two groups and then serving each group only one of the options during the testing period while measuring the outcome. It is a very powerful tool for ensuring that the user experience (UX) of your game is optimum. Small tweaks can have a big impact on a player’s behavior, especially when the most effective design choices are often counterintuitive.
Let’s say you have a “Buy Space Milk” button in your game, but you’re unsure whether black with white text or white with black text will create the most response (Figure 4.2). To AB test the options, you first equally assign your users to one of two groups. You decide that odd user IDs used to track players will be in group A and those with even IDs will be in group B. You then serve group A the black button and serve group B the white button. The next day you look at the metrics and find that group A clicked or tapped the button 25 percent more than group B, so you subsequently set the button to black, increasing revenue by the same amount.
Figure 4.2. Buy Space Milk AB button options.
The usefulness of the AB testing process can also extend outside of the game to its marketing, testing different creative ads or even a game name. In addition, the process can even be used as the simplest form of MVP. Mark Pincus, Zynga’s CEO, uses the technique for new game concepts: The company AB tests the marketing proposition, the name and artwork of a game concept in an ad against that of an existing title. This allows Zynga to compare the appeal of the proposed game against that of a known popular title by measuring the CTR (Clickthrough Rate). The more clicks the ad gets per an amount of impression, the more appealing it is to players and the more likely it is to find success in the market.
When more than two options are available, the same process is known as multivariate testing, but often erroneously is still referred to as AB testing. Multivariate testing allows you to go much deeper by comparing a range of options. In fact, the number of options you can test is limited only by the time it takes to get a reasonable number of players to play through each one.
Imagine that same “Buy Space Milk” button, but this time you want to test both black and white plus a gray button in four different designs, giving you a total of 12 options (Figure 4.3). You again divide your players into 12 groups using sequential user IDs and serve each group a unique button, observing the results and choosing an option after a week. Why a week? Because the groups are smaller, exposure to each button over the same period is less, so testing requires a longer duration to get a reasonable quality of data.
Figure 4.3. Buy Space Milk multivariate button options.
As with any scientific experiment, it is important that you yield results you can be confident in. Data that can be used to prove or disprove something with a good certainty is of high quality, whereas inaccurate data is of poor quality. Therefore, it is essential during an AB or multivariate test that you expose the options to enough players to create a picture of how that user behavior can be used as a good basis for a decision. The more tests you run and the more your average absorbs anomalies, the more you can trust your results. This phenomenon is known as regression towards the mean.
Although there’s a great deal of math that you can use to calculate the certainty (or significance) of your results, it can be complex. However, if during your tests you see big swings on the average, such as the CTR in the Buy Space Milk button example, you can’t be sure your results are accurate. But if you see the CTR vary little over 100,000 impressions, you can be reasonably certain of the results. Additionally, it’s important that the sample size for each group be equal (e.g., each button is shown to the same number of players) so confidence can be equal; otherwise, you may be comparing good data against bad.
You must also consider your test groups’ histories and how it may impact the data you are hoping to collect. For example, if you want to understand the relationship between price and demand for your IAP, it would be smart to run an AB test. In this case, let’s say you split all of your players evenly: The A group gets the current price of $2.99 and the B group gets the new price of $1.99. The results then show that those in the B group bought 50 percent more of the IAP than those in the A group over a week’s time and revenue increased; therefore, you set your IAP price to $1.99. However, after a month you find that the number of IAPs purchased is back to pre-change levels and revenue is actually less than before. Why?
In the test, those in the B group had been exposed to the old higher price, so their judgment was influenced by a comparison to it. The test actually confirmed that a discounted IAP increased uptake, which tells you little about the price/demand relationship.
If you run the test again for first-time players only and see that group B players buy more IAPs than group A players but only by 5 percent, you can derive a new hypothesis: At $1.99 there are more purchases of an IAP than when its price is at $2.99, but the price reduction results in overall lower revenue.
The quality of data is integral to analytics. Poor data from small or ill-suited sample groups can, and over time will, lead you to derive a false understanding from them. This false understanding can lead to making poor decisions that will harm your game. Yet with good data you can learn a great deal about your players.