The Fine Art of Customer Surveys: Using Apache Drill to Extract Insights From Your Survey Data

Contributed by

10 min read

Customer feedback is a valuable tool for every business, and one of the primary ways to get quality feedback is through surveys. However, asking customers to fill out lengthy surveys with 15+ questions will often result in a very low response rate. Most customers are not willing to take a long survey, and the ones who do often regret it after the first couple of questions.

The solution is to make the task of filling out a survey an easy and positive experience. Avoid long introductions to the survey – most people feel discouraged and will not continue and if they sense a cumbersome process ahead. Surveys that have a lot of irrelevant or broad questions are less likely to succeed. Try to focus the questions on specific things related to their experience rather than generic “one size fits all” questions. Make the survey accessible by structuring your questions carefully and clearly.

One way to gauge your customers’ loyalty is to use a tool called the Net Promoter, or Net Promoter Score (NPS®), which is a customer loyalty metric developed by Fred Reichheld, a Fellow of Bain & Company and a board member of Satmetrix. NPS measures the loyalty of a customer base based on the answers of a simple, direct question: "How likely is it that you would recommend our company/product/service to a friend or colleague?" The survey answers range from 0 to 10. The score is calculated using all the answers and its value will lie between -100, where no one would promote the service, to +100, where everybody would be a promoter.

NPS is used by industries across multiple verticals as a tool to identify which factors lead to better engagement, and to understand what can be done to improve customer loyalty if their company has a low score.

You can also measure the success of new business and marketing strategies by using an A/B testing approach: a group of similar users is exposed to a change that affects their service, and their feedback is compared against a control group that has not been exposed to a change. The results from this approach can help you measure your customers’ satisfaction or dissatisfaction with a particular business or marketing strategy, and will encourage proactive, data-driven decision making.

Another approach is to measure the effectiveness of local decisions made by each branch or territory. This is an opportunity to do an internal benchmarking and promote internal competition based on satisfaction, rather than just on revenue.

Technology advancements have been the driving force in the evolution of survey methods over the years. Recent technology developments in data collection, processing, analyzation and visualization can reduce the feedback loop and help scale and automate some of your decisions. In this blog post, I’ll review how can you can use big data technologies such as Apache Hive and Apache Drill to calculate Net Promoter Score and correlate that data with other customer data.

How to calculate

Let’s take a deep dive into the technical details of calculating NPS. It can be done with a fairly simple formula in any Excel spreadsheet. First, let’s assume we have a list of customer IDs and their answers to the relevant questions:

customer_id, answer

Let’s start by ignoring the ones that answered 7 and 8 – those are the undecided or neutrals. They are not unhappy with the company/product/service, but somehow it didn’t exceed their expectations enough to turn them into an enthusiastic Promoters. The ones that answered 9 or 10 are the real Promoters, and the Detractors are those with answers from 0 to 6.

In the example above, we have 10 answers from different customers, of which 3 are Detractors, 4 are Promoters and 3 are neutral. The Net Promoter Score consists of subtracting the percentage of customers who are Detractors from the percentage of customers who are Promoters.

NPS = %Promoters - %Detractors

Using the previous example we have:
% Promoters = 40%
% Detractors = 30%
NPS = 10

A score above zero is considered good.

The above calculation can be made using Apache Hive. Let’s assume we have data in a similar data as above in a file called nps.csv. This file contains the survey answers in a CSV format, the most popular structured format.

Initialize Hive by typing hive.
Create an external table pointing to the answers file as follows:

CREATE EXTERNAL TABLE nps(customer_id INT, answer INT)
LOCATION '/user/mapr/nps.csv'
tblproperties ("skip.header.line.count"="1");

To calculate the score, just run the following SQL query:

SELECT  100*(
            SUM(CASE WHEN answer >= 9 THEN 1\.  END) -
            SUM(CASE WHEN answer <= 6="" then="" 1.="" end)="" )="" count="" (*)="" as="" score="" from="" nps;="" <="" pre="">

The result will be the expected: 10.

**Drilling for more insight**

Let’s now take it a step further and attempt some ad-hoc querying. Ad-hoc querying is useful when exploring the underlying patterns in the data, and extracting more insights from it. To perform ad-hoc querying, you can use Apache Drill.

Apache Drill is an open-source distributed engine to perform low-latency querying in large datasets. It features schema-on-read capabilities, so there is no need to create a table like in Hive. It also supports other sources of data like <a href-"https:="""" products="" m7"="" target="_new">Apache HBase</a>, so you can extend the querying power by combining data from different systems.

Start Apache Drill and connect through sqlline as follows:

`bin/sqlline -u jdbc:drill:zk=local -n admin -p admin`

To calculate the NPS as above, we can execute a similar query:

SELECT 100( SUM(CASE WHEN CAST(columns[1] as INT) >= 9 THEN 1 ELSE 0 END) - SUM(CASE WHEN CAST(columns[1] as INT) <= 0="" 1="" 6="" then="" else="" end)="" )="" count()="" as="" score="" from="" dfs.="" user="" mapr="" nps.csv;="" <="" pre="">

Let’s now do some data exploration by combining it with other data such as demographics.

That data can be contained in an HBase table or in another CSV file. For simplicity, let’s assume the following demographics.csv file with customer_id, gender and dob columns:


Let’s combine both files using a JOIN operation by customer_id.

SELECT d.columns[0] as customer_id, d.columns[1] as gender, d.columns[2] as dob, CAST(nps.columns[1] as INT) as answer
FROM dfs.`/user/mapr/demographics.csv` d
JOIN dfs.`/user/mapr/nps.csv` nps ON nps.columns[0] = d.columns[0];

The result is the following:

| customer_id |   gender   |    dob     |   answer   |
| 1001        | M          | 1991-02-13 | 6          |
| 1002        | F          | 1993-01-15 | 6          |
| 1003        | M          | 1990-12-08 | 7          |
| 1004        | F          | 1986-11-30 | 9          |
| 1005        | M          | 1987-08-19 | 9          |
| 1006        | F          | 1988-06-21 | 10         |
| 1007        | M          | 1992-09-12 | 5          |
| 1008        | F          | 1986-07-05 | 8          |
| 1009        | M          | 1988-09-06 | 7          |
| 1010        | F          | 1988-05-13 | 10         |

To finish, let’s calculate the NPS of each segment:

SELECT a.age, 100*(
            SUM(CASE WHEN a.answer >= 9 THEN 1\. ELSE 0\. END) -
            SUM(CASE WHEN a.answer <= 6="" then="" 1.="" else="" 0.="" end)="" )="" count(*)="" as="" score="" from="" (="" select="" extract(year="" age(cast(d.columns[2]="" date)))="" age,="" cast(nps.columns[1]="" int)="" answer="" dfs.`="" user="" mapr="" demographics.csv`="" d="" join="" nps.csv`="" nps="" on="" nps.columns[0]="d.columns[0])" a="" group="" by="" a.age;="" <="" pre="">

The produced results are grouped by age:

+------------+------------+ | age | score | +------------+------------+ | 23 | -100 | | 22 | -100 | | 24 | 0 | | 28 | 50 | | 27 | 100 | | 26 | 66 | +------------+------------+

It’s easy to see that there are more Promoters who are 27 years old than those in their early twenties.

A similar query can be used to calculate the metric from different branches or other customer-related dimensions.

By using big data tools such as Apache Drill and Apache Hive, you can perform all kinds of analyses on your captured survey data. By being able to analyze your company’s NPS score, you can make real-time adjustments in your business operations in order to increase customer satisfaction, improve your relationships with your customers, and ultimately grow your business.

This blog post was published October 27, 2014.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now