Understand the math behind it all: Normal Distribution

If the N-Armed bandit problem is the core struggle of each testing program, then normal distribution and the related central limit theorem is the windmill that groups use to justify their attempts to solve the N-Armed bandit problem. The central limit theorem is something that a lot of people have experience with from their high school and college days, but very few people appreciate where and how it fits into real world problems. It can be a great tool to act on data, but it can also be used blindly to make poor decisions in the name of being easily actionable. Being able to use a tool requires you to understand both its advantages and disadvantages, as without context you really achieve nothing. With that in mind, I want to present normal distribution as the next math subject that every tester should have a much better understanding of.

The first thing to understand about normal distribution is that it is only one type of distribution. Sometimes called a Gaussian distribution, the normal distribution is easy identifiable by its bell curve. Normal distribution comes into existence because of the central limit theorem, which states that any group, under sufficiently large number of independent random variables, and with a continuous variable outcomes, the mean will approximate a normal distribution. To put another way, if you take any population of people, and they are independent of each other, then an unbiased sample of them will eventually turn into an attractor distribution, so that you can measure a mean and a standard deviation. This gives you the familiar giant clumping of data points around the mean, and that as you move farther and farther away from that point, the data distribution becomes less and less in a very predictable way. It guarantees that an unbiased collection done over a long period of time, the mean will reach normal distribution, but in any biased or limited data set, you are unlikely to have the a perfectly normal distribution.

The reason that we love these distributions is that they are the easiest to understand and have very common easy to use assumptions built into them. Schools start with these because they allow an introduction into statistics and are easy to work with, but just because you are familiar with them does not mean the real world always follows this pattern. We know that over time, if we get collect enough data in an unbiased way, we will always reach this type of distribution. It allows us to infer a massive amount of information in a short period of time. We can look at distribution of people to calculate P-Score values, we can see where we are in a continuum, and we can easily allow us to group and attack larger populations. It allows us to present data and tackle it in a way with a variety of tools and an easy to understand structure, freeing us to the steps of using the data, not figuring out what tools are even available to us. Because of this schools spend an inordinate amount of time in classes presenting this problems to people, without informing them of the many real world situations where they are may not be as actionable.

The problem is when we force data into this distribution when it does not belong, so that we can make those assumptions and so we act with a single measure of “value” of the outcome. When you start trying to apply statistics to data, you must always keep in mind the quote from William Watt, “Do not put your faith in what statistics say until you have carefully considered what they do not say.”

There are a number of real world problems with trying to force real world data into a normal distribution, especially in any short period of time.

Here are just a quick sample of real world influences that can cause havoc when trying to apply the central limit theorem:

Data has to be representative – Just because you have a perfect distribution of data for Tuesday at 3am, it has little bearing on being representative of Friday afternoon.
Data collection is never unbiased, as you can not have a negative action in an online context. Equally you will have different propensities of action from each unique groups, and with an unequal collection of those groups to even things out.
We are also stuck with the data set that is constantly shifting and changing, from internal changes and external changes in time, so that as we gather more data, and as such take more time, the time we take to acquire that data means that the data from the earlier gathering period becomes less representative of current conditions.
We have great but not perfect data capturing methods. We use representations of representations of representations. No matter what data acquisition technology you use, there are always going to be mechanical issues which add noise on top of the population issues listed above. We need to focus on precision, not become caught in the accuracy trap.
We subconsciously bias our data, through a number of fallacies, which leads to conclusions that have little bearing on the real world.

In most real world situations, we more closely resemble multivariate distribution then normal distribution. What this leaves us with is very few cases in the real world that get the point that we can use normal distribution with complete faith, especially in any short period of time. Using it and its associated tools with blind loyalty can lead to groups making misguided assumptions about their own data, and lead to poor decision making. It is not “wrong” but it is also not “right”. It is simple another measure of the meaning of a specific outcome.

Even if the central limit theory worked perfectly in real world situations, you still have to deal with the differences between statistical significance and significance. Just because something is not due to noise, it does not mean that it answers the real question at hand. There is no magical solutions to remove the need for an understanding of the discipline of testing nor the design of tests that answer questions instead of just pick the better of two options.

So how then can we use this data?

The best answer is to understand that there is no “perfect” tool to make decisions. You are always going to need multiple measures, and some human evaluation to improve the accuracy of a decision. A simple look at the graph and having good rules around when you look at or leverage statistical measures can dramatically improve their value. Not just running a test because you can, and instead focusing on understanding the relative value of actions is going to insure you get the value you desire. Statistics is not evil, but you can not just have blind faith. Each situation and each data set represents its own challenge, so the best thing you can do is focus on the key disciplines of making good decisions. These tools help inform you, but are not meant to replace discipline and your ability to interpret the data and the context for the data.