**8 min read**

Building a good classification model requires leveraging the predictive power from your data and that’s a challenge whether you’re looking at four thousand records or four billion; in machine learning parlance, this step is referred to as “feature extraction”. It’s also applicable whether you’re just starting to extract value from your data or you’re taking your modeling talents to Hadoop with tools like Apache Spark.

Most methods handle __numeric__ (i.e. interval) data natively but some can’t take __nominal__ values (i.e. discrete data with no inherent ordering). In those cases, we need a way to present information contained in nominal variables to our modeling algorithm.

Some examples of nominal data are shown below:

**Table 1: Examples of Nominal Variables**

Zip code makes the list because the numbers don’t mean anything. The zip code 93402 isn’t twice as good as 46701, despite what the folks in Los Osos, Calif. might think.

Representing the categories as a series of binary variables (i.e. isRed, isBlue, etc.) works as long as the number of possibilities is small and the observations for each is large enough for your algorithm to recognize its value. Encoding nominal data with risk tables addresses those limitations and is a powerful tool for building good models.

The simplest form of risk is the ratio of the target rate in a given category to the overall rate. Using fraud as an example, the equation for risk in a category (i) is shown below:

The table below shows an example of simple fraud risk by cities. The risk has been calculated based on the total frauds and non-frauds using the formula above. The risks go into a table; preparing an input record for your modeling algorithm involves assigning the risk from the corresponding line in the table.

**Table 2: Simple Risk Table by City**

The fraud risk for Detroit is 1.84 because the fraud rate there (4.6%) is 84% higher than the overall rate (2.5%). A risk of 1.0 expresses a neutral rate. The risk in Albuquerque is less than 1.0 because the fraud rate is lower than 2.5%. Your algorithm gets the message that the likelihood of fraud in Detroit is higher than in Albuquerque. Sometimes, it is advisable to take the log of risk. This adjusts a neutral risk to 0.0 and converts high risk to positive numbers and low risk to negative. The neutral rate is important because that’s what you’ll assign to new categories when scoring.

However, *there are some major limitations with this approach*. In the table, Sarasota has a fraud rate of 0.0 because we never observed fraud in only 3 opportunities. We’re getting a signal that the risk of fraud there might be low, but assigning a lower fraud rate based on 3 observations (compared to 475 for Albuquerque) doesn’t seem right. Even worse, if we want to take the log of a zero risk, we’re going to run into problems. We need to incorporate the size of the categories in calculating risk…

To address the limitations of simple risk (and protect against over-fitting), a smoothing parameter is introduced. The new formula is shown below (with updates in red):

The effect of the smoothing parameter is to pull the simple risk towards neutrality (1.0). When the smoothing parameter is zero, the formula defaults to the simple risk. As the parameter increases, the effect on categories with few observations gets larger. The table below shows the improved risk by city with smoothing parameter (s) set to 25 and 50:

**Table 3: Smoothed Risk by City**

Adding the parameter did little to affect the risk in Detroit where there were many observations. The impact comes in assigning practical risks to small categories. Instead of an unlikely risk of 0.0, Sarasota now gets a more reasonable smoothed risk of 0.89 when s=25. Using s=50 has a greater effect on its risk, pulling it even closer to neutrality (0.94). Even with a small s, you can calculate the log(Risk) with confidence. A rule of thumb for choosing the parameter is the following: “*set s to the number of observations where you would like the smoothing effect to taper off*.”

Building smoothed risk tables from your training data and applying your model to the test set produces better generalization (i.e. your model learns that fraud is __unlikely__ in Sarasota, but it’s not __impossible__).

The real power of log(risk) is assessing the *collective* risk from situations involving multiple categories. The table below shows a smoothed risk table for healthcare claim overpayment by procedure codes:

**Table 4: Smoothed Risk Table of Claim Overpayment by Procedure Code**

A healthcare claim can be composed of multiple codes with varying risk. But the decision to deny is usually made on the __claim__ level. You can assess the overall risk of a claim overpayment by calculating the sum of log(risk). For example, a claim with a 90-min office visit __and__ a liver biopsy would have a combined risk of 0.061 (0.223-0.163). A liver biopsy alone is associated with a diminished risk of overpayment, but the stronger risk noted for the 90-min office visit insures that risk for this co-occurrence is interpreted by your algorithm as positive.

Programming the risk tables is straightforward in any language, but those looking for sample data and a Python script can find it here: __SparkRiskTables__.

This approach is applicable to nominal data regardless of the provenance. But what happens when you have billions of records?

Big Data Problems are usually addressed in a distributed file system (such as Hadoop). The files are split among many nodes; to access the file as a whole, variations on the “map-reduce” programming model are used. You can write your own application to access this distributed data, but it’s much simpler to use the tools (i.e. Pig, Hive, etc.) that take care of the framework for you.

Apache Spark has a library for training classifications models in a Hadoop environment (MLLib, which has many other machine learning tools too), but you still have to provide that algorithm with good features to consider. Smoothed risk tables are just as applicable here - we just need to use map-reduce to count frauds and non-frauds before calculating risk.

The pseudocode to create smoothed risk tables in Spark appears below:

- Read in data from the file system
- Count number of frauds per category (via map-reduce)
- Count number of total records per category (map-reduce again)
- Note: you could combine the steps above so only one MR job is needed

- Calculate total frauds & records (don’t need another MR because we can just sum over the category counts)
- Iterate through the categories and calculate smoothed risk using fraud & total counts
- Note: the default smoothing parameter = 50 but another can be passed at runtime.

Some sample data, sbt file, scala code plus instructions for helping you to get started with Spark are located in the same repository noted above: __SparkRiskTables__.

This blog post was published May 06, 2015.

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Share

Share

Share