The Big Data High Horse (or Why Most Big Data Projects Fail)

Raschin Fatemi
3 min readJun 15, 2021

This post was written in 2015.

— the elementary unit of information — is a difference which makes a difference”. (Gregory Bateson, 1972)

During the past few years, the emergence of some new technologies in data infrastructure enabled all types of businesses to access, explore, transform and display their web analytics and customer data. It seems that data is the core of any successful business and everyone urges to become data-driven. From healthcare services to e-commerce, businesses start to collect their customer data and measure the impact of every change on customer behavior. These new technologies pouring in a wide stream of data from all intentional and unintentional customer behavior have constantly broadened the possibilities of describing an event. However, the challenge for business users is to select the most useful information piece among a gazillion descriptions that show the root cause of an event with sufficient confidence.

Data, if used effectively, provide an objective mechanism to evaluate hypotheses about the decision that needs to be made. In Information theory, the effectiveness of knowledge is measured by ‘Information Gain’. In simple words, Information Gain measures how much knowing a piece of information helps us in making the desired decision, or a measure to understand what makes a difference.

The above formula states that information gain(IG) is defined by the difference in ‘information entropy’ when not knowing(a) versus knowing(a). In other words, information gain shows us the benefit of knowing(a).

To make this more clear let's look at a business case. Imagine that we are a hospitality firm and our business problem is: The turnaround for preparing a room after checkout is too long.

One of our hypotheses for the root cause of the above problem is: we have too few cleaning staff.

To test our hypothesis we have access to rows and rows of data showing us the size and the number of rooms, the staffing size, the average time of cleaning room, etc. The challenge is to find which data variable is the best predictor for the truthfulness of our hypothesis. For example, Graph 1A shows the comparison of the average time to clean a room with the #of cleaning staff. This report doesn’t contain any information to test our hypothesis, therefore the information gain of Graph 1A is little or zero.

In another graph (1B) we show the same comparison but this time we break down the average time by an hour in a day. This time the report shows us a clue and increases our confidence in our hypothesis.

From a product design perspective, we should make sure that our interface allows our business users to measure the information gain of each variable which goes beyond slicing and dicing the data and representing it in different graphical formats.

--

--