March 21, 2014 | BY Dr. Kirk Borne
Nearly 50 years ago, one of the most popular musical movies of all time was released (“The Sound of Music”). Perhaps the most memorable song from that production is “My Favorite Things”. A remake of the production was shown on live television in December last year, and it inspired me to think about a few of my favorite things, particularly data science things. So, I started compiling a list, with the intention of writing an article about those favorites. But soon the list became too long for a single article, and the idea was born to put the list into an “A to Z Glossary” format. At that point, the fun challenge was thinking of interesting and useful big data science concepts that fit the glossary model. Data science is all about fitting models, so I accepted the challenge. The following is the result of those deliberations. It is a glossary that lists a few of my favorite things about big data and data science, from A to Z (actually, ZZ), one for each letter. There are no “raindrops on roses” or “whiskers on kittens” here, but ponies and elephants are fair game.
Of course, this glossary represents my own preferences, and there are many other possible choices. Please feel free to add some of your own favorite things in the comments. The descriptions provided below are very brief—this is just the first installment—we will look more deeply into some specific ones from these glossary entries in future blog posts. So, here we go:
A – Association rule mining: unsupervised machine learning method for finding frequently occurring patterns (item sets) in discrete data (numeric or categorical).
B – Bayes belief networks (BBN): algorithm for building a network of conditional dependencies among a large number of variables, which enables prediction, classification, and missing value imputation.
C – Characterization: methodology for generating descriptive parameters that describe the behavior and characteristics of a data item, for use in any unsupervised learning algorithm to find clusters, patterns, and trends without the bias of incorporating class labels.
D – Deep learning: one of the hottest new machine learning algorithms in recent years, useful for finding a hierarchy of the most significant features, characteristics, and explanatory variables in complex data sets. It is particularly useful in unsupervised machine learning of large unlabeled datasets.
E – Ensemble learning: machine learning approach that combines the results from many different algorithms, whose combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can muster.
F – Forests, random: a decision tree classifier that produces a “forest of trees”, yielding highly accurate models, essentially by iteratively randomizing one input variable at a time in order to learn if this randomization process actually produces a less accurate classifier. If it doesn’t, then that variable is ousted from the model.
G – Gaussian mixture models (GMM): an unsupervised learning technique for clustering that generates a mixture of clusters from the full data set using a Gaussian (normal) data distribution model for each cluster. The GMM’s output is a set of cluster attributes (mean, variance, and centroid) for each cluster, thereby producing a set of characterization metadata that serves as a compact descriptive model of the full data collection.
H – ??: as a hat-tip to my MapR host here, I postpone “H” to the end!
I – Informatics: data science for data-intensive science. There are many examples: Bioinformatics, Geoinformatics, Climate informatics, Environment informatics, Health and Medical informatics, Biodiversity informatics, Urban informatics, Neuroinformatics, Cheminformatics, Astroinformatics, etc.
K – K-anything in data mining: K-Nearest Neighbors algorithm (for classification), K-Means (for clustering), K-itemsets (for association rule mining – see A, above), K-Nearest Neighbors Data Distributions (for outlier detection), KD-trees (for indexing and rapid search of high-dimensional data), and more KDD (Knowledge Discovery from Data) things.
L – Local linear embedding (LLE): a form of manifold learning that discovers the true topological shape of your data distribution, which might be quite warped and twisted when seen in the coordinate space that is represented by your easily available database attributes, though in fact the data may actually lie on a complex hyperplane (i.e., in the natural coordinates of the data domain).
M – Multiple weak classifiers: an example of ensemble learning, applied to classification problems in which you can generate a large number of different classifiers, where none of them are particularly accurate (hence, they are weak), but when combined they can yield a strong voting (scoring) metric to determine a data item’s most likely classification. A research paper with a very interesting title was written on this subject: “Good Learners for Evil Teachers”.
N – Novelty detection: another name for outlier detection, or anomaly detection, or interestingness discovery, which I prefer to call Surprise Discovery—finding the novel, surprising, and unexpected data points or patterns in your data set that lie outside the bounds of your expectations. This even applies to social networks, in which you can find “interesting subgraphs” within the network.
O – One-class classifier: an efficient logistic classification technique, which is used to test if a data item belongs to a particular class or not. This is useful in cases where there are a variety of alternative classes, but your attention is focused on only one of many possible outcomes. This is also used in novelty detection (see above).
P – Profiling (specifically, data profiling): a collection of data exploration methods that enable you to find the good, bad, and ugly parts of your data set. For example: examining the unique values for a database attribute (which is a great way to find typos in discrete categorical data, such as US state names—I once did this for a NASA project and I discovered that there were over 90 distinct US state names in the database).
Q – Quantified and Tracked: the second half of my new definition of Big Data that I am promoting: “Big Data is Everything, Quantified and Tracked!” The quantification and measurement (tracking) of anything therefore allows data science to play a major role in nearly every application domain (hence, job security and huge job opportunities for data scientists).
R – Recommender engines: These are probably the most fun and most profitable applications of data science to big data collections. Learn more in these two articles: “Design Patterns for Recommendation Systems – Everyone Wants a Pony” and “Personalization – It’s Not Just for Hamburgers Anymore.”
S – SVM (Support Vector Machines): powerful Jedi machine learning classifier. Among classification algorithms used in supervised machine learning, SVM usually produces the most accurate classifications. Read more about SVM in this article: “The Importance of Location in Real Estate, Weather, and Machine Learning.”
T – Tree indexing schemes: tree-based data structures, brilliantly implemented in the super-fast machine learning algorithms delivered by SkyTree Corporation, developed at GeorgiaTech University’s FastLab.
U – Unsupervised exploration: the purest form of data mining, exploring unlabeled datasets with unsupervised machine learning algorithms (e.g., Clustering, Association Mining, Link Analysis, PCA, Outlier Detection). One researcher expressed it this way: “unsupervised exploratory analysis plays an important role in the study of large, high-dimensional datasets that arise in a variety of applications.”
V – Visual analytics: exploratory and explanatory visualizations of large complex datasets. Visual storytelling is a critically important analytics component of a data scientist’s duties: to explore and explain discoveries in big data collections visually, because “a picture is worth a thousand words” (i.e., “a picture is worth 4 kilobytes”).
W – WEKA: free data mining package, for data exploration, profiling, mining, and visual analytics, containing hundreds of machine learning algorithms, techniques, and methods.
X – XML, specifically PMML: Predictive Modeling Markup Language, which is an XML language for describing and sharing (machine-to-machine) predictive models learned within a data mining process (such as Data Mining-as-a-Service, or Decision Science-as-a-Service).
Y – YarcData: an important new vendor in the field of big data science, specifically developing high-performance computing architectures for linked datasets. Their UrikaTM product is an in-memory graph database, which can hold up to 0.5 petabytes (= 500 terabytes) of graph data in memory!
ZZ – Zero bias, Zero variance: two of the most common myths in big data analyses. These myths suggest that the sample bias and/or the variance in various parameter values should go to zero as the size of the data set gets larger and larger. This is simply not true. The sample bias and variance is a feature of your data collection process, no matter how much data you collect.
H – Hadoop (of course, Hadoop! Did you think that I forgot about the “H”?): Hadoop is the de facto big data computing paradigm. It enables distributed processing of large data sets across clusters of commodity servers. The Hadoop ecosystem now includes the compute engine, scripting language, file server, database, analytics tools, query language, and workflow manager. To learn more, check out the Executive’s Guide to Big Data and Apache Hadoop from MapR, which you can download for free here.
Come back next week to see which of my favorite big data and data science things receive more attention and deeper coverage. By the way, in case you missed the pony and elephant that were mentioned in the opening paragraph, look again at “R” and “H”.