Identify Your Data Breach with Apache Drill

Contributed by

5 min read

There are bad guys out there trying to steal your customers’ accounts right now. Numerous big data methods have been unable to eradicate fraud completely. It’s important to score customer transactions to prevent the takeover, but you may be ignoring crucial information about where the accounts were intercepted.

In just a few simple steps, you can analyze your data to find the source of compromise. Using Apache Drill means that you can analyze massive amounts of semi-structured transactions in seconds using the Map-Reduce model, and shut down a breach before it does real damage.

Instructions on how to read the sample data, load the UDF and run the Drill query are located in the following repository:

What does a Breach look like anyway?

It’s difficult to describe what a real breach looks like, but you will know it when you see it. To identify a potential breach, we assess the amount of activity of accounts later experiencing fraud at each merchant and then visualize the results. A real breach should stick out from the pack. The figure below (produced in Tableau) shows this process in action.

Over a million merchants contributed to the figure above. More than 800k (note the logarithmic Y-axis) had a breach score of about zero. But only one scored above 80. As the note explains, this merchant had malware on their POS system and was investigated by the authorities (i.e. an actual breach).

Derivation of the Breach Score

The Breach Score should reflect how many accounts visited each merchant before being labeled fraud in excess of what is typical. The Log-Likelihood ratio is one effective way to measure that excess. Here’s the setup for each merchant:

Based on this setup, the breach score for each merchant is calculated by the following:

Apache Drill Query

Here’s a sample of the JSON file used to develop and test the Drill query. Chances are your data may not be in this exact format, but it should boil down to a customer ID, merchant ID, and an indicator that shows whether the account turned out to be a fraud.


The following query will read that file and calculate a log-likelihood for each merchant – the higher a log-likelihood, the better the chances that too many fraud accounts visited that merchant before going fraud (i.e. the source of a breach):

select MERCH.merchant,
        MERCH.merchCountFraud as `n11`,
        MERCH.merchCountNonFraud as `n21`,
        COUNTS.countFraud as `n1dot`,
        COUNTS.countNonFraud as `n2dot`,
        loglikelihood(cast(MERCH.merchCountFraud as INT),
                cast(MERCH.merchCountNonFraud as INT),
cast(COUNTS.countFraud as INT),
                cast(COUNTS.countNonFraud as INT)) as `logLike` from (
        select 1 as `dummy`,merchant,
          sum(merchFraud) as `merchCountFraud`,
          sum(merchNonFraud) as `merchCountNonFraud` from
          (select merchant,
          case when fraud='1' then 1 else 0 end as `merchFraud`,
          case when fraud='0' then 1 else 0 end as `merchNonFraud`
          from `dfs`.`/Users/jblue/drillBreach/sample.json`
group by merchant) `MERCH`
JOIN ( select 1 as `dummy`,
        sum(totalFraud) as `countFraud`,
        sum(totalNonFraud) as `countNonFraud` from
        ( select case when fraud='1' then 1 else 0 end as `totalFraud`,
           case when fraud='0' then 1 else 0 end as `totalNonFraud`
           from ( select distinct acct,fraud from `dfs`.`/Users/jblue/drillBreach/sample.json`)
on MERCH.dummy=COUNTS.dummy
ORDER by loglike desc
limit 10

Most of the query is standard SQL; the function that does the work (“loglikelihood”) is an example of a Drill UDF. You can read more about creating custom functions with Drill here:

Merchant #5902 had visits from 16 accounts that later turned fraud and 95 that didn’t. According to the math, that was significantly more than we would have expected just by chance. A score of 7+ isn’t something to get too excited about, but simulated data doesn’t usually generate juicy results. Generally, a real breach might get twice the score of the next-highest merchant.

Running in Production

This script should be set up to run frequently. The period will be based on the number of transactions. You should always be updating your list of known frauds, , then vary the period of time you look back through to find where those fraud accounts converged on a merchant or ATM. For instance, you might take the frauds that occurred in one week and look at the 3 preceding weeks for a suspicious confluence of future frauds.

It may seem like a lot of effort to find just one compromised merchant, but it appears that eternal vigilance is the price of breach detection.

This blog post was published October 01, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now