In the simplest of terms, the art of web analytics is about finding meaningful differences in website visitor behavior and, then, exploiting those differences to make better marketing decisisons.
Of course, there is a question of what a “meaningful difference” means. A “meaningful difference” means a difference that is not due to pure random chance. Web Analytics data has a high degree of randomness. Accordingly, what appears to be a difference between different segments may not be a difference at all.
So, then, how can we distinquish between pure randomness and a meaningful difference? Some statistical analysis is often helpful. But, what kind of statistical analysis? Bayesian approaches are often very useful, and, perhaps, better than traditional Frequentist statistics.
Bayesian approaches to statistical inference offer several advantages over traditional Frequentist statistics. If you are not really sure what Bayesian statistics means, you can read about it here and here.
Ok, let us explain what we mean by using some hypothetical data. Suppose we have the following data as shown in Table 1.
Category | Took The Desired Action | Did Not Take The Desired Action |
Category One | 5000 | 100000 |
Category Two | 3000 | 100000 |
In Table 1, we have Category One and Category Two. Think of Category One and Category Two as being two segments of interest. By just eyeballing the data, in Table 1, we probably can tell that there is a meaningful difference between Category One and Category Two. The difference is not due to pure chance. The reason that we can make this conclusion is because we have lots of data points. Table 1 shows over 100,000 data points for each category. With that amount of data collected, we can likely conclude that there is a meaningful difference between Category One and Category Two.
So, if you are the owner of Really Big Internet Company, then doing a statistical test on the data set in Table 1 is probably not worth your time.
But, suppose you are not the owner of Really Big Internet Company. Instead, suppose, you are the owner of Ma and Pa Shop, which has a website that doesn’t get thousands of visitors per day. At Envision Analytics, much of our work is with Ma and Pa Shops and not with Really Big Internet Companies.
Table 2 shows a data set that Ma and Pa Shop might realistically have:
Category | Took The Desired Action | Did Not Take The Desired Action |
Category One | 20 | 200 |
Category Two | 6 | 200 |
The Ma and Pa Shop data set doesn’t contain 100,000 data points for each category. Instead, each category has a couple of hundred data points. With a couple of hundred data points for each category, can we be sure that there is a meaningful difference between Category One and Category Two? Probably not.
To see if there is a meaningful difference between the two categories in Table 2, some statistical analysis would be helpful. Typically, with this sort of data, the Frequentist approach ( the stats you probably learned in school, if you took a stats course) would use some sort of z-test (or perhaps a t-test).
But, there are a few problems with the Frequentist approach.
One problem with using a Frequentist z-test approach is that you are supposed to decide how much data you want to collect BEFORE you do the test. The Bayesian approach does not require you to decide how much data you need to collect before doing statistical inference.
In addition, as we accumulate more data, we often want to perform statistical test multiple times to see if there is any meaningful difference between categories. However, if you are performing multiple test using a Frequentist approach to see if you can find an appropriate p-value, then you are inflating the chances of getting what is known as Type I error. In short, repeatedly applyting a z-test (or any other Frequentist hypothesis testing procedure for that matter) as you collect more data is VERY bad practice.
The Bayesian Approach has none of these problems. With the Bayesian approach, there is no need to define the amount of data you want to collect before peforming statistical analysis. Also, multiple testing is not a problem with the Bayesian approach. The reason the Bayesian approach doesn’t have these problems is because, unlike typical Frequentist approaches, Bayesian analysis is conditional upon the data.
Those trained in Frequentist statistics might point out that Frequentist statistics has methods to make corrections for mulitple testing and for conducting post-hoc inference. However, these procedures have other issues. Ultimately, however, Frequenist procedures don’t really answer what we are interested in. A Frequentist p-value DOES not tell us the possible values of some parameter of interest. In our example, the “parameter of interest” is the difference between the proportions of Category One and Category Two. The Bayesian approach tells us exactly what we really want to know, which is, given the available data, what is the difference in proportions between Category 1 and Category 2.
Ok, lets get back to our example. Using the data from the Ma and Pa Shop we can compute the following:
Category | % Taking Desired Action | 2.50% | 97.50% |
Category One | 9.1% | 5.9% | 13.5% |
Category Two | 2.9% | 1.40% | 6.20% |
So, how did we compute the Ma and Pa data? We used a simple beta-binomial model for each category. We can use R to do the computations for us. R is an open source statistical package. We can make some really quick computations using R. Here is the script:
yes = total number taking desired action #substitute appropriate number here
number = total number #substitute appropriate number here
a = yes+1 #this is the alpha parameter for the posterior distribution
b=number – yes + 1 #this the beta parameter for the posterior distribution
I = 10000 #the number of samples to draw from the posterior distribution
interval<-rbeta(I,a,b) #rbeta is the R function for the beta distribution, which is the posterior
quantile(interval,c(0.025,0.975)) #use the quantile function to compute a posterior interval
Ok, let us explain a little what the values in Table 3 mean. The second column (labeled % Taking Desired Action) are the proportions computed from the data we have in Table 2 for each category. Category One has a computed proportion of 9.1% (computed simply from 20/220) and Category Two has a proportion of 2.9% (computed from 6/206). But here is the thing; how certain are we that these proportions mean anything? After all, we are dealing with random data. Also, what possible range could the proportions be in? We can answer these questions by computing what is known as a posterior interval. What we have done here is to compute a 95% posterior interval. At the lower end (2.5%), we have a value of 5.9% for Category One and 1.4% for Category Two. At the high end (97.5%), we have a value of 13.5% for Category One and 6.2% for Category Two.
Our R script helped us figure the posterior interval for both categories. In a nutshell, the R script takes 10,000 samples from a beta distribution (which is the posterior, under our model) and stores the samples in the variable “interval”. The quantile function gives us the 2.5% and 97.5% quantiles. We ran the R script for both Category One and Category Two.
Just looking at this table, and without doing any thing further, we can probably be pretty conformtable that,yes, there is a meaningful difference between Category One and Category Two. However, you might notice that there is some overlap between the two posterior intervals. Accordingly, maybe you need a little more proof that there is a real difference between the two populations.
We can easily compute the differences. First, here is a histogram of the computed differences:
By just eyeballing our histogram, we can be pretty confident that the difference between Category One and Category Two is most likely not zero. But we can become a little more confident by computing the 95% posterior interval. Table 4 shows the posterior interval:
2.50% | 97.50% |
1.7% | 10.9% |
What Table 4 shows is that we can be 95% percent sure that, based on the data that we have (and also assuming our statistical model is reasonably realistic), the difference between Category One and Category Two is somewhere between 1.7% and 10.9%.
Table 4 was computed by assuming that each category could be modeled as a binomial random variable. Because we are using a Bayesian analysis, we need to specify appropriate priors. An appropriate prior is a beta distribution. We assumed both beta priors are independent of each other, an assumption that is valid for many applications. For each beta prior, our alpha and beta are set at 1, which means that we are not making strong assumptions about values of the parameters for each binomial random variable.
The computation is not really difficult. Again, we can use an R script to do the work for us. Here is the script:
n1 = 220 #total number of Category 1 visits
y1 = 20 #number of Category 1 visits taking desired action
n2 = 206 # total number of Category 2 visits
y2 = 6 # number of 2 Category 2 visits taking desired action
I = 10000 # number of simulations
theta1 = rbeta(I,y1+1,(n1-y1)+1) # make I draws from Category One
theta2 = rbeta(I,y2+1,(n2-y2)+1) # make I draws from Category Two
diff = theta1-theta2 # computed difference between Category One and Category Two proportions
quantile(conf,c(0.025,0.975))# compute the 95% posterior interval
Ok, those well versed in Bayesian stats and who know something about web analytics data might object to our technique. The basis of the objection would be that valid Bayesian Statistical models rely upon the concept of Exchangeablity. Most likely Category One and Category Two Visitors are comprised of sub populations (in Web Analytics lingo “segments”) that are not exchangeable. Hence our little beta-binomial models are not valid.
So ok, here is the reply. Our main concern is exploiting data as quickly as possible to drive better decision making, it is not to get in deep epistomeologic hand wringing over Exchangeability. Now we can all agree that web analytics data is a stochastic process, which means that what appears to a “trend” may not be a trend at all, but just random chance. We need some way of determining real differences between random “noise”. Using our simple beta-binomial models is better than just guessing. Also, what matters to us is that our model works well in practice. Often simple beta-binomial models do work reasonablly well, even if they are not entirely correct. We can always use another model if they are not reasonable under the circumstances.
Bayesian statistics are very helpful in determining whether there are “meaningful differences” in web analytics data. The Bayesian approach has significant advantages over Frequentist methods.
Our examples used two categories. In many situations, it is often desirable to compare more than two categories. There are Bayesian approaches to comparing more than two categories. In future posts, we will write about them. In this post, our intent was to give a simple example to show how Bayesian approaches can be used to analyze web analytics data. Hopefully, we got you interested in learning Bayesian Statistics, so you can better analyze your own web analytics data.