In a previous post, I presented a “__Big Data A to Z Glossary of my Favorite Data Science Things__”. I would like to focus here on one of those favorite things, which also happens to be the entry associated with my favorite letter.

When I learned to read and write (in first grade), my favorite letter of the alphabet soon became the letter K. Why? Well, it was primarily because my first name has a pair of them. In fact, 50% of my first name consists of K’s. Furthermore, very few of my classmates could boast even one K in their name! K was special to me.

Years later, when I was a PhD student in Astronomy at Caltech, I began to work in an exciting new field of astrophysics: studying the dynamical evolution of galaxies through their collisions and mergers with each other! In order to understand what happens when two galaxies crash into each other, I needed to investigate a sample of galaxy pairs (*i.e.,* two galaxies in close proximity to one another). As it turns out, one of the early pioneers in the field had just compiled a catalogue of double galaxies – his catalogue provided detailed data for several hundred examples of galaxies in pairs. That researcher’s name was __Igor Karachentsev__. So, when I began publishing research papers on my computational simulations and new observations of galaxy pairs, I used Karachentsev’s catalogue identifiers to label the specific pairs that I was investigating, including __K99__, __K175__, __K564__, etc. It was not long after this that I was confronted by another astronomer who accused me of being very self-centered to name these galaxies after myself: K-this and K-that! Of course, he misunderstood the origin of the K in the galaxy identifiers. I was amused by that coincidence – and so continued my interest in the letter K.

As I migrated from astrophysics research to data mining research, and ultimately data science, starting about 15 years ago, I began to see an interesting pattern – the occurrence of algorithms and concepts that began with the letter K. How kould this be happening? Kould it be just a koincidence? (*Note to editor*: my misspellings of these K words are intentional.)

Here are some examples of what I am talking about – notice the interesting pairing of a K-word with its corresponding data mining concept or technique:

__KDD__(Knowledge Discovery in Databases) = data mining__KNN__(K-Nearest Neighbors) = classification algorithm__K-means__= clustering algorithm__K-itemset__= the unit of data in association rule mining__KNN-DD__(K-Nearest Neighbors Data Distributions) = outlier detection algorithm__KD-trees__= hierarchical (divide and conquer) indexing scheme for rapid search of high-dimensional data

In all of these cases except the first one (KDD), the K refers either to a group of K data entries (KNN, K-itemset, KNN-DD) or to K groupings in the data space (K-means, KD-trees). Most of the K-algorithms listed above are among the most representative, popular, and straightforward methods in data mining corresponding to their respective categories of machine learning (classification, clustering, and outlier detection = supervised, unsupervised, and semi-supervised learning, respectively).

Each of the above algorithms has at least one parallel implementation, which can be run on __Hadoop__ clusters, in the cloud, and/or __using MapReduce techniques__. For example, let's consider the ubiquitous (perhaps over used) __K-means clustering algorithm__. K-means begins with the arbitrary specification of the number of clusters (number=K) and an assigned value for each cluster’s mean = the data attribute values corresponding to the centroid (*i.e.,* the mean of the attribute values) for that cluster’s members. These initial values are just guesses, though hopefully they are intelligent and informed guesses. After this step, the algorithm is run – scan the database, one item at a time, assigning each item to the cluster whose centroid is closest to it (*i.e.,* most similar, using some __distance or similarity metric__). After the first full scan of the database, all of the data items have been assigned to one of the K clusters. At that point, the real centroids of the clusters are calculated (by calculating the mean values of the cluster attributes for all of the cluster members). Then, the database is scanned again, one item at a time, assigning each item again to the cluster whose centroid is closest to that item. Centroids are re-calculated and the database scan and item-cluster assignments are repeated, until the cluster centroids converge (either because the centroids do not change from one iteration to the next, or because the changes are smaller than some pre-specified threshold). The final result of K-means clustering is a set of K clusters with fairly homogeneous memberships – the items in one cluster are more similar to themselves than they are to the members of other clusters. How do we know if our initial choice of K was appropriate? And how can we determine how “good” the clustering is (*i.e.,* how compact are the clusters relative to their separations?) Fortunately, there are some simple but useful cluster evaluation metrics that can be used to validate the strength (purity and isolation) of the clusters, including this pair of metrics: the __Dunn Index and the Davies-Bouldin Index__.

So, where is the parallelization in the above sequence of steps? It can actually be implemented in a pair of different places. First, it can be used very effectively in the database scans that take place at each step in the cluster iteration process. The data items in the database can be partitioned to different nodes in the parallel computing environment, where the assignment of each data item to one of the K clusters is determined. That can be done in an embarrassingly parallel manner, with no reference to other data items. Similarly, parallelization can be applied in another stage of the K-means clustering algorithm. The calculation of the mean (centroid) of a massive set of data items (*e.g.,* within individual clusters) can also be computed quickly using a __MapReduce approach__, which usually carries out its task by invoking the “Map” function on __pairs of data entries__.

So, whether you are __Hadoop’ing and data mining some big data__ with the __MapR M7 Enterprise Database Edition for Hadoop__ or just exploring the __rich kollection of data science koncepts__, spending some quality time with the letter K is a good place to start.

This blog post was published March 27, 2014.

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Share

Share

Share