Data Science: Do the Numbers - Part 1

Contributed by

11 min read

As a lifelong computational scientist (and now data scientist) I have always been fascinated with numbers, especially lists and tables of things (= databases!). For example, I thought early in life that I wanted to be a Math major in college and study number theory so that I could learn all of the amazing ways to do cool stuff with numbers. But I also wanted to study the wonders of the Universe as an astronomer, so I went on to get a PhD in astrophysics! That career path allowed me to study and apply all of the disciplines that I enjoy: math, physics, astronomy, computational modeling, and data science! It was numbers all the time!

One of the more entertaining and useful branches of mathematics (for applying numbers to data science problems) is combinatorics, which focuses on combinations of objects that belong to a finite (countable) set, subject to specific constraints or criteria. One mathematician describes it this way: "Combinatorics is mainly about counting. The nice thing about counting is that you learn something about the thing that you are counting." That is a perfect definition for demonstrating why combinatorics is useful in data science. I describe here a few examples of interesting numbers and/or combinations of numbers in data science applications (or else related to data science in some way).

Example 1: Girl Scout Cookies

I often gave this assignment to my computational data science students: if the Girl Scouts Cookie Catalog lists 25 different types of cookies (which is a finite set of objects), and you have enough money for only 5 boxes of cookies (which is a constraint), then how many different combinations of 5 different types of cookies must you consider if you want to examine all such combinations? Before we find the answer to this question, let us start with a simpler version of the problem: suppose that there are only 4 different types of cookies (A, B, C, and D), and you have funds for only 2 boxes. Here are the different unique combinations that exist in this simple case: AB, AC, AD, BC, BD, CD. Hence, there are 6 different unique combinations in this case.

The combinatorial theorem from mathematics provides the general solution to our problem for any set of numbers. We will apply that theorem to our case with "25 types of cookies, choose 5". To calculate this quickly, we need to make use of a special math function -- the factorial, expressed with an exclamation point (N!). N-factorial refers to the product of N multiplied times each positive integer less than N, until you reach 1. For example: 5! = 54321 = 120. And 25! = 252423...21, which is a very large number, greater than 10 to the power of 25 (that is, 1 followed by 25 zeros). So, the combinatorial theorem (which uses the factorial function) specifies the number of ways to choose k elements from a set of N: the value can be computed as (N!) divided by (k!)*(N-k)!.

For our girl scout cookie problem, the result of 25! divided by 5!20! is 53,130. We can validate this by applying the theorem to our simple case in which there were 4 types of cookies, choose any 2: the result is 4! (24) divided by 2!2! (4), which equals 6. So, if you are contemplating all possible combinations of 5 different types of Girl Scout cookies from the catalog of 25 types, then you need to consider over 53,000 combinations. If you spend 1 second thinking about each of those options, then you need nearly 15 hours to complete the task -- unless you perform the evaluations in parallel instead of in series. With a parallel computer working on the problem, perhaps with 100 compute nodes in the cluster, requiring only 1 millisecond to evaluate each combination, then the entire task (53,130 evaluations) can be completed in half a second! Ka-ching! Score one for distributed parallel computing!

Example 2: More Girl Scout Cookies

Now consider the same problem as above but with a twist. Here is the rationale: I don't necessarily want 5 different types of cookies. I might want 2 or more of some types of cookies. So, the new problem is this: given 25 different types of cookies, with funding sufficient for only 5 boxes (of any type), how many ways can I make the selection of 5 boxes? Let us start again with the simple example of 4 types (A, B, C, D), pick 2 (including duplicates). The 16 options are: AA, AB, AC, AD, BA, BB, BC, BD, CA, CB, CC, CD, DA, DB, DC, DD. Note that there are some duplicates in this list (e.g., BA and BA are called degeneracies: they correspond to the two different ways of making a selection that leads to the same final result) -- that's okay since it reflects the actual probability distribution. For example, think of rolling two dice (6 values, choose 2): there are 36 possible outcomes, but there are several degeneracies, such as two ways that you can get a "1" and a "6".

In general, to select k items (including duplicates) from a set of N, there are N choices for each of the k selections, so the total number of combinations is N to the power of k. Therefore, for our Girl Scout cookie problem, there are 2525252525 (= 25 to the 5th power = 9,765,625) combinations of "given 25 things to choose from, pick 5 (including duplicates)." This is a much larger number than in our first example above (more than 180 times larger), and that is a critical fact to remember (which we will revisit in Part 2 of this article). As we noted before, computing these types of combinatorial selections in parallel is much faster than computing them serially.

Example 2b: Blogging

While describing the above example, a related example came to mind. It occurred to me that there are at least four dimensions to writing good blogs: creative (original), interesting (engaging), informative (educational), and useful (applicable to a real use case). Let us assume for simplicity that there are 5 possible ratings in each of those dimensions: one-star (poor), two-star, three-star (average), four-star, or five-star (best). Thus, there are 625 (= 5 to power of 4) possible overall blog rankings since there are 5 rating choices in 4 categories (555*5). Only 16 of those overall blog rankings have either four-star or five-star rating in all four categories: given 2 ratings to choose from, pick 4, one for each of the 4 categories (= 2 to the power of 4 = 16). Hence, it is hard to write a blog that is 5-star in all categories: creative, interesting, informative, and useful. In fact, it is about 40 times (625 / 16) harder than a random-quality blog. This reminds me of the old adage "If 100 monkeys type on keyboards long enough, eventually they will produce Hamlet." It turns out that the likelihood of such a thing happening is infinitesimally small. We will examine that conjecture, plus some others, using more combinatorics in Part 2 of this article.

Example 3: Graph Analytics

Before we conclude this first part of the article, let us look at one more topic: graphs! In mathematics, graph theory is often listed as a subfield of combinatorics. Graphs are defined as "mathematical structures used to model pairwise relations between objects." Consider a social network graph as an example that satisfies the definition of combinatorics that we gave earlier: a social network graph has nodes (a finite set of objects) and edges between nodes (specific constraints). When we perform social network analysis (or any graph analytics), we measure (i.e., count) several things: the degree of connectivity among nodes, shortest distance between nodes, connectivity within subgraphs, interesting subgraphs (with higher-than-random connectivity), authoritative nodes (with high centrality in the graph), and isolated nodes (with low connectivity to other nodes). Dynamic network analysis further includes the following: transmission, feedback, cascading, contagion, tipping points, and other causal consequences of connectivity in the graph. All of these analyses involve counting: (a) counting nodes; and (b) counting edges.

Graph analytics is one of the most important applications of data science, and it will become even more universally applied in the future, since nearly everything in the world has some relationship to other things. This "network of things" is more general and even more universal than the sensor-based internet of things that we now see emerging everywhere. Networks are truly everywhere: people, computers, things, products, processes, etc. Consequently, applications of data science combinatorics to graph analytics problems will continue to grow in number and significance.

In many data science methods (including graph analytics, social network analysis, feature engineering, ensemble modeling, optimization, and prescriptive analytics) as well as many analytics applications (e.g., digital marketing campaigns, market basket analysis, recommender engines, fraudulent behavior in financial networks), we need to test different combinations of parameters, models, objects, data sources, and validation metrics. Sampling, modeling, and testing such a combinatorially-intensive problem space is thus a feature of advanced analytics.

In such applications, we have learned that fast parallel computations are not only important, they are imperative! Fast computational applications, architectures, and algorithms are essential in advanced analytics and data science. This is where we will find Spark and other distributed parallel data computing architectures (such as MapR Event Store and the MapR Data Platform) playing a major role in the number-crunching that current and future analytics applications demand as we penetrate the seemingly impenetrable mathematical world of combinatorics.

We invite you to continue reading Part 2 of this article for more examples of data science by the numbers.

This blog post was published January 22, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now