You are currently viewing A-B test. Again. Testing a new design.

A-B test. Again. Testing a new design.

Author: Evgeny Bodyagin, https://dsprog.pro

Introduction

You need to test the effectiveness of the new web application design. There is the old design “A”, there is a new design “B”. Let’s do an A-B test. Let’s make a reservation that there are many variations of conducting A-B tests. It is necessary to take into account the nuances of the analyzed object/process in each individual case. This material will show the basic steps of the A-B test.

Experiment Design

  • Null Hypothesis: No difference in performance between Design A and Design B.
  • Alternative Hypothesis: There is a significant difference in performance between Design A and Design B.
  • Processing Object: Two samples of site visitor sessions mixed randomly. The sample sizes are the same.
  • Target Audience: Site visitors in the last 30 days
  • Metrics:
  • Duration of the test: 30 days
  • Levels of significance:
  • α = 0.05
  • Campaigns run at the same time, and the audience is divided randomly.

Data structure

Each row of data represents information about user visits to web applications.

  • User ID (User_ID): User ID.
  • Campaign_Name): Campaign name (design A or design B).
  • View content (View): The number of times the user has viewed content on the site.
  • Clicks: Number of clicks per banner. The number of users who used design A and design B are equal.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sys

import warnings
warnings.filterwarnings('ignore')
from scipy.stats import mannwhitneyu

alpha = 0.05 

Loading data

alldata = pd.read_csv('input/Xab_group.csv')
df_userAgg = pd.read_csv('input/Xab_group.csv')
alldata.head()
User_IDCampaign_NameViewClick
01a campaign4.00.0
12b campaign12.00.0
23a campaign1.00.0
34a campaign5.00.0
45a campaign8.00.0

Check for duplicates and null values

alldata.duplicated().sum()

0

alldata.isnull().sum()

User_ID 0
Campaign_Name 0
View 0
Click 0
dtype: int64

There are no duplicate values in the original data; there are no null values either.

grouped = alldata.groupby('Campaign_Name')
a_group = grouped.get_group('a campaign')
b_group = grouped.get_group('b campaign')
desired_palette = ["#0099cc", "#009900"]
data_all = pd.concat([a_group, b_group])
fig, ax = plt.subplots(ncols=2, figsize=(18,6))
ax1 = sns.barplot(data=data_all, x='Campaign_Name', y='View', errorbar=('ci', False), ax=ax[0], estimator='sum', palette=desired_palette)
ax2 = sns.barplot(data=data_all, x='Campaign_Name', y='Click', errorbar=('ci', False), ax=ax[1], estimator='sum', palette=desired_palette)
# this content is from dsprog.pro
ax1.set_title('View')
ax2.set_title('Click')

plt.show()

a_group_View_sum = a_group['View'].sum()
b_group_View_sum = b_group['View'].sum()
a_group_Click_sum = a_group['Click'].sum()
b_group_Click_sum = b_group['Click'].sum()

print('A group View sum: %s'%(a_group_View_sum))
print('B group View sum: %s'%(b_group_View_sum))
print('A group Click sum: %s'%(a_group_Click_sum))
print('B group Click sum: %s'%(b_group_Click_sum))

A group View sum: 149422.0
B group View sum: 151117.0
A group Click sum: 10423.0
B group Click sum: 12133.0

Design B is more efficient than design A. Both indicators show a positive effect. But… do our findings have statistical significance?

A-B test

Calculate the averages for each indicator; for each group: A and B. We also calculate the differences in the average values.

columns = ['View', 'Click']
obs_diffs = {}
for column in columns:
    view_cont= alldata.loc[alldata['Campaign_Name']=='a campaign'][column].mean()
    view_test= alldata.loc[alldata['Campaign_Name']=='b campaign'][column].mean()
    print('Mean of %s "a campaign": %s\n' % (column, round(view_cont,3)))
    print('Mean of %s "b campaign": %s\n' % (column, round(view_test,3)))
    
    obs_dif_vw= view_test - view_cont
    obs_diffs[column] = obs_dif_vw
# this content is from dsprog.pro
    print('Diff means of %s : %s\n' % (column, round(obs_dif_vw,3)))
    print("**************")

Mean of View “a campaign”: 4.981
Mean of View “b campaign”: 5.037
Diff means of View : 0.056

Mean of Click “a campaign”: 0.347
Mean of Click “b campaign”: 0.404
Diff means of Click : 0.057

Let’s apply the bootstrap technique to estimate the difference in the mean values of variability between the groups a campaign and b campaign by repeatedly generating random samples and calculating the difference in the means. Let’s form histograms based on the results.

columns = ['View', 'Click']
boxes_data = {}

for column in columns:
    n_iterations = 10000
    group_control = alldata[alldata['Campaign_Name'] == 'a campaign'][column].values
    group_test = alldata[alldata['Campaign_Name'] == 'b campaign'][column].values
    
    
    diffs_viw = np.empty(n_iterations)
    np.random.seed(200)
    # this content is from dsprog.pro
    for i in range(n_iterations):
        random_indices = np.random.choice(len(group_control), len(group_control), replace=True)
        sample_control = group_control[random_indices]
        sample_test = group_test[random_indices]
        
        diffs_viw[i] = np.mean(sample_test) - np.mean(sample_control)
        
    diffs_view = np.array(diffs_viw)
    boxes_data[column] = diffs_view
    low, high = np.percentile(diffs_view, 2.5), np.percentile(diffs_view, 97.5)
    
    plt.figure(figsize=(6,5))
    plt.xlabel(column + ': proportions of differences', fontsize=15)
    plt.ylabel(column + ': frequency', fontsize=15)
    plt.axvline(x= low, color= 'r', linewidth= 2)
    plt.axvline(x= high, color= 'r', linewidth= 2)
    plt.hist(diffs_view)

For each indicator, we will make a comparison between the p-value and the alpha level to accept or refute the null hypothesis.

The p-value for View is: 0.1218
Unable to reject Null Hypothesis, no statistically difference between “a group” and “b group” 0.1218 > α on View

The p-value for Click is: 0.0
Reject Null Hypothesis, statistically significant difference between “a group” and “b group” 0.0 < α on Click

Conclusions

By the View indicator, we cannot reject the 𝐻0 hypothesis. The new form of design did not bring statistically significant changes to the rate of page views by users of the web application. However, the new design had a statistically significant impact on the click rate. Thus, in general, there is still a significant difference in performance between design A and design B.

======================
For the full content of this article; for cooperation with the author, write to Telegram: cryptosensors
The full content of this article includes files/data:

  • Jupyter Notebook file (ipynb),
  • dataset files (csv or xlsx),
  • unpublished code elements, nuances.

Distribution of materials from dsprog.pro is warmly welcome (with a link to the source).