A while back, we made a __list from A to Z__ of a few of our favorite things in big data and data science. We have made a lot of progress toward covering several of these topics. Here’s a handy list of these write-ups (as well as an added bonus on one of those topics at the end).

**A** – **Association rule mining:** described in the article “__Association Rule Mining – Not Your Typical Data Science Algorithm__.”

**C – Characterization:** described in the article “__The Big C of Big Data: Top 8 Reasons that Characterization is ‘ROIght’ for Your Data__.”

**H – Hadoop (of course!):** described in the article “__H is for Hadoop, along with a Huge Heap of Helpful Big Data Capabilities__.” To learn more, check out the __Executive’s Guide to Big Data and Apache Hadoop__, available as a free download from MapR.

**K – K-anything in data mining:** described in the article “__The K’s of Data Mining – Great Things Come in Pairs__.”

**L – Local linear embedding (LLE):** is described in detail in the blog post series “When Big Data Goes Local, Small Data Gets Big”.

**N – Novelty detection (also known as “Surprise Discovery”):** described in the article “__Outlier Detection Gets a Makeover - Surprise Discovery in Scientific Big Data__.” To learn more, check out the book __Practical Machine Learning: A New Look at Anomaly Detection__, available as a free download from MapR. As an added bonus (and because **Surprise Discovery is my most favorite** of all data science things), we provide below a few more insights into this all-important discovery method in big data analytics applications.

**P – Profiling (specifically, data profiling):** described in the article “__Data Profiling – Four Steps to Knowing Your Big Data__.”

**Q – Quantified and Tracked:** described in the article “__Big Data is Everything, Quantified and Tracked: What this Means for You__.”

**R – Recommender engines:** described in two articles: “__Design Patterns for Recommendation Systems – Everyone Wants a Pony__” and “__Personalization – It’s Not Just for Hamburgers Anymore__.” To learn more, check out the book __Practical Machine Learning: Innovations in Recommendation__, available as a free download from MapR.

**S – SVM (Support Vector Machines):** described in the article “__The Importance of Location in Real Estate, Weather, and Machine Learning__.”

**ZZ – Zero bias, Zero variance:** described in the article “__Statistical Truisms in the Age of Big Data__.”

Finally, we take another look here at **N – Novelty Detection**, which goes by many other names: outlier detection, anomaly detection, deviation detection, and (my favorite) surprise discovery! The goal of novelty detection methodology is to find the rare thing in your data collection—the thing that is different from the rest, and the features that occur in your data that are outside the bounds of your normal (and/or statistical) expectations.

Outliers generally fall into one of four broad categories: (1) statistically explainable data points that are several standard deviations from the mean of the data distribution (which you would not expect in small data collections, but which will start popping up within big data collections that have millions or billions of data points); (2) data quality problems (hence, these outliers are important indicators that some data cleaning is required); (3) data pipeline errors (hence, these outliers indicate that something is wrong with the processing, wrangling, or analytics tools that you are using); or (4) discoveries (*i.e.,* these types of outliers are truly the novel, interesting, unexpected, surprising, and potentially most insightful features in your big data collection—the proverbial “needle in the haystack!”).

Note that novelty detection also applies to “__interesting subgraphs__” within a graph (network) database, such as social networks. A well-documented historical example of this is the anomalous (unexpected) __network connections among the 9-11 terrorists__.

Some people (including this author) would say that novelty detection is the best and most sought-after outcome of data science applications on big data. We hope and anticipate that very large data collections carry enormous potential for surprising discoveries. Such discoveries will span the full spectrum of statistics: ranging from rare one-in-a-million (or one-in-a-billion) types of objects or events (novelties), to the complete statistical specification of many classes of objects (based upon millions of instances of each class), as well as every use case in between those two ends of the statistical spectrum of discovery.

The growth in data volumes from all aspects of science, __government__, __healthcare__, __retail__, __financial services__, __telecommunications__, etc. (including data from social media, sensors, monitoring systems, and simulations) requires increasingly more efficient and more effective knowledge discovery and extraction algorithms. These algorithms are often applied in big data computing environments, such as __Hadoop clusters__. Among these algorithms are a __large variety of anomaly detection methods__ (for outlier/novelty/surprise discovery). Novelty detection algorithms enable data scientists to discover the most “interesting” objects, events, and behaviors embedded within large and high-dimensional datasets. These items are often labeled the “__unknown unknowns__.”

Effective novelty detection in data streams (including the __Internet of Things__) is essential for the rapid discovery of potentially interesting and/or hazardous events. Emerging unexpected conditions in hardware, software, or network resources need to be detected, characterized, and analyzed as soon as possible for obvious system health and safety reasons. Similarly, emerging unusual or anomalous behaviors and variations in customer behaviors, social events, mechanical devices, transportation systems, financial networks, natural environments, etc. must also be detected, characterized, and assessed promptly in order to enable rapid decision support in response to such events.

We have developed a new algorithm for novelty detection (__KNN-DD: K-Nearest Neighbor Data Distributions__) that defines an outlier as a point whose behavior (*i.e.,* whose location in parameter space) deviates in an unexpected way from the rest of the data distribution. Our algorithm evaluates the local data distribution around a test data point and compares that distribution with the data distribution within the sample defined by its K nearest neighbors. Since this KNN-DD thing is a bit sciencey, if you’re a practical-minded reader who is interested in Novelty Detection (and Surprise Discovery), please download a copy of a new book from MapR titled __Practical Machine Learning: A New Look at Anomaly Detection__, and let the discoveries begin!

This blog post was published October 02, 2014.

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Share

Share

Share