8 min read
Dimensionality reduction is a critical component of any solution dealing with massive data collections. Being able to sift through a mountain of data efficiently in order to find the key descriptive, predictive and explanatory features of the collection is a fundamental required capability for coping with the Big Data avalanche. Identifying the most interesting dimensions of data is especially valuable when visualizing high-dimensional (high-variety) big data and when telling your data’s story.
There is a “good news, bad news” angle here. First, the bad news: the human capacity for visualizing multiple dimensions is very limited: 3 or 4 dimensions are manageable; 5 or 6 dimensions are possible; but more dimensions are difficult-to-impossible to assimilate. Now for the good news: the human cognitive ability to detect patterns, anomalies, changes, or other “features” in a large complex “scene” surpasses most computer algorithms for speed and effectiveness. In this case, a “scene” refers to any small-_n_ projection of a larger-N parameter space of variables.
In data visualization, a systematic ordered parameter sweep through an ensemble of small-_n_ projections (scenes) is often referred to as a “grand tour”, which allows a human viewer of the visualization sequence to see quickly any patterns or trends or anomalies in the large-N parameter space. Even such “grand tours” can miss salient (explanatory) features of the data, especially when the ratio N/_n _is large.
Machine learning algorithms (e.g., the random forest algorithm) are increasingly effective at finding the most explanatory (most predictive) features in big data. But that presumes that you already know what needs to be explained! That is a supervised learning approach (in which you know in advance the key classes of objects and events represented within your data). But what if you don’t those key classes yet? How do you find the interesting features within your data in the first place? That requires an unsupervised learning approach along with some human understanding of what defines “interesting.”
Consequently, a cognitive analytics approach that combines the best of both worlds (machine learning algorithms and human perception) will enable efficient and effective exploration of large high-dimensional data. One such approach is to apply computer vision algorithms, which are designed to emulate human perception and cognitive abilities.
Computer Vision (CV) is a methodology (based on a set of algorithms) that enables computers to interpret what a sensor visually perceives. CV is not a new field, but it has traditionally been applied primarily to image processing and image analysis. CV algorithms include edge-detection, gradient-detection, motion-detection, change-detection, object-detection, segmentation, template-matching, and pattern recognition. Many of these same algorithms can be applied to high-dimensional data streams that are not images but are “scenes” (such as still frames in a grand tour) that are projections of high-dimensionality data into lower-dimension parameter spaces. This is truly a cognitive analytics approach.
One possible outcome of using CV is the generation of “interestingness metrics” that signal to the data end-user the most interesting and informative features (or combinations of features) in high-dimensional data (or that are discovered in a grand tour). Interestingness can be measured using specific observable parameters or can be inferred via the detection of interesting patterns in the data. An example of the latter is latent (hidden) variable discovery.
Latent variables are not explicitly observed but are inferred from the observed features in a data set. Latent variables are inferred primarily because they are the variables that cause the all-important interesting descriptive, predictive, and explanatory patterns seen in the data set. Latent variables can also be concepts that are implicitly represented by the data (e.g., the “sentiment” of the author of a social media posting).
Some latent variables are “observable” in the sense that they can be generated through some “yet to be discovered” mathematical combination of several of the measured variables. Consequently, these cases are an obvious example of dimensionality reduction for visual exploration of large high-dimensional data.
Latent variable models are used in statistics (e.g., in Bayesian statistics, or with Latent Dirichlet Allocation) to infer variables that are not observed but are inferred from the variables that are observed. Latent variables are widely used in social science, psychology, economics, life sciences and machine learning. In machine learning, many problems involve collecting high-dimensional multivariate observations and then hypothesizing a model that explains them. In such models, the role of the latent variables is to represent properties that have not been directly observed.
After inferring the existence of latent variables, the next challenge for the data scientist is to understand them. This can be achieved by exploring their relationship with the observed variables (e.g., using Bayesian Networks, Linked Data, and Graph Models). Several correlation measures and dimensionality reduction methods such as PCA can also be used to measure those relationships. Since we don’t know in advance what relationships exist between the latent variables and the observed variables, more generalized nonparametric measures like the Maximal Information Coefficient (MIC) can be used.
MIC became popular to some extent because it provides a straightforward R-squared type of correlation strength estimate that objectively measures the dependency among variables within a high-dimensional data set. Since we don’t know in advance what a latent variable actually represents, it is not possible to predict the type of relationship that it might possess with the observed variables. Consequently, a nonparametric approach makes sense in the case of large high-dimensional data, for which the interrelationships among the many variables is a mystery. Exploring variables that possess the largest values of MIC can help us to understand the type of relationships that the latent variables have with the existing variables, thereby achieving both dimensionality reduction and the explicit specification of a parameter space in which to conduct visual exploration of high-dimensional data.
In big data analytics, there are at least 5 levels of analytics maturity. First, Descriptive Analytics refers to hindsight (reporting) on your domain from your data. Second, Diagnostic Analytics refers to oversight (real-time reporting) of your domain through your data. Third, Predictive Analytics refers to obtaining and applying foresight from your data (for predicting outcomes). Fourth, Prescriptive Analytics refers to obtaining and applying insight from your data (for proactively achieving optimal outcomes). Finally, Cognitive Analytics refers to delivering the “right sight” from your data = “learning the right question to ask your data, at the right time, at the right place, for the right object (or event), in the right context.” Now, that is most interesting!
The cognitive analytics techniques described here can help data end-users to discover and understand hidden data patterns that may lead to the most interesting insights (and the right questions to ask) from their massive data collections.
In such applications of massive data exploration, data explanation, and data exploitation, we acknowledge that fast analytics architectures and algorithms are essential tools for the modern data scientist. Tools like Apache Spark and other fast data computing architectures (such as MapR Event Store and the MapR Data Platform) provide the number-crunching power that cognitive analytics applications demand, and tools like Apache Drill provide access to all of the variables and features across your data repositories in order to find what’s most interesting in your data.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.