Four Examples of Characterizations for Discovery from Big Data

Contributed by

10 min read

We previously discussed the _“_Top 8 Reasons that Characterization is Right for Your Data_.”_ Here we move the discussion of characterization from the theoretical to the practical, by providing four simple examples of characterizations of data. In each of these cases, the set of characterizations that are generated can then be fed into different types of analytics algorithms for discovery from your data: predictive patterns, clusters (segments), associations, correlations, trends, and anomalies (outliers, surprises).

1.**Strengths of signals from sensors:** (Note that even social media is a sensor – of customer sentiment and customer engagement.) You can characterize the strength of the signal and then track it as a function of time. For example: you can measure customer sentiment during a real-time advertising campaign (perhaps Super Bowl ads, or ads during the World Cup games, or any other real-time events that you monitor and capture data from). You can rate the signal strength on a simple scale, such as N3, N2, N1, P0, P1, P2, P3, which ranges from strongly negatively N3 to negative N2 to mildly negative N1 to neutral P0 to mildly positive P1 to positive P2 to strongly positive P3. This set of characterizations applied to the real-time data stream yields a condensed representation of the full data set, and yet it still provides the key signal that you are searching for, concisely and informatively. If you see a pattern developing (such as P1 P1 P0 P0 N1 N2 N2 N2), then you might want to take one type of action, whereas you may take a different action if you detect an opposite pattern (P1 P1 P2 P2 P3 P2 P3 P3). We can use these characterizations to find trends, predictive patterns, or correlations with external events.

2.**Changes in a numerical sequence:** You can represent the relative change of data values in a numerical sequence with simple characterizations. For example, you can measure the changes in the trading prices of stocks in the financial markets over consecutive short time intervals (minutes or 10 seconds, or whatever). The changes are either up (U) or down (D) or no change (0). A stock’s behavior may be UUUDUU00UU over a period of 10 time increments (for a temporal sequence that is trending upward). Another stock may be DD00UDD00D (for a temporal sequence that is trending downward). Generating characterizations like this can be repeated for thousands of temporal patterns, and for more than 10 time increments. After you collect these temporal sequence patterns for thousands of stocks (or other events that you may be measuring), you can cluster them (to find similar behaving stocks), or identify anomalous behaviors (that are completely different from other stocks), or identify oppositely behaving stocks (one goes up, the other goes down), or perhaps even discover predictive patterns (e.g., if you find that a certain pattern has high probability of being followed by another specific pattern). These symbolic characterizations (U, D, 0) provide a condensed representation of the changes in numerical sequences without the burden of carrying around absolute numerical values (which are almost certainly different from one entity to another, and thus such complex numerical sequences would be very difficult to cluster and correlate). A similar set of temporal pattern characterizations was used in an article that asked this question: “What shape is the economic recovery in: U, V or W?” In this example, the characterization U, V, or W represents the behavior of several data points in the sequence (e.g., W represents DUDU, and V represents DDUU, while U represents D00U).

3.**Categories of purchases:** You can use short labels to represent different categories of purchases by your customers. For example, suppose you categorize the purchases of 20 of your retail store customers in this way: {BP, HG, M5}, {DC, RM, M5}, {FC, M5, PS}, {PS, FC}, {DC, FC}, {FC}, {DC}, {DC, RM}, {HG, DC, AW}, {M5, CD}, {CD}, {HG, CD}, {RM, FC, PS}, {MS, MC}, {DC, MC, RM}, {RM, BP}, {RM}, {CO, DC}, {CO, DC, RC}, {CO, DC, FC, PS}. Each of these labels is a characterization of a particular category of product. For example, CO represents colas and soft drinks; M[?] represents some category of music CD’s (MC=classical; MS=soft rock; M5=boy bands); and so on. You can quickly mine these simple characterizations of customer purchase patterns for associations and patterns – you will discover that 20% (4 out of the 20) of the customers purchase PS and FC together, and 100% of those that purchase PS also purchase FC. In this case, PS is the label for Picnic Supplies, and FC is the label for Fried Chicken. If you have access to multi-channel customer data, such as social media, then the next time one of your customers updates their social media status with a statement that they are going on a summer picnic, then you can send them a discount (or other offer) on fried chicken. They may consequently be encouraged to come to your store to buy the picnic supplies (at full price) and the fried chicken (at a discount price). This form of association mining works well if you aggregate and assign a single label to different products that belong the same product category. This condensed representation of the longer, more complex product descriptions enables unstructured data to become more structured, and thus easier and more efficient to organize and mine for interesting patterns.

4.**Meta-tagging of data:** We discussed in another article how both machine-assisted tagging (perhaps using Machine Learning) and human-assisted tagging (perhaps using crowdsourcing) can lead to extraordinarily useful metadata for search, retrieval, reuse, and discovery from big data collections: “Collaborative Annotation for Scientific Data Discovery and Reuse.” Meta-tags are essentially metadata = “data about data”. The metadata are a condensed representation of the content and context of the data. For example: identifying a set of customers who buy sunscreen and who also buy running shoes versus identifying a set of customers who buy running shoes and DVDs may help to separate two categories of exercise-conscious customers: those who run outdoors and those who run indoors on a treadmill. A lot of other data about these customers have just been condensed into a couple of meta-facts that now tell you quite a bit. Such meta-tags can be used for association mining (e.g., for use in a recommender engine), or in anomaly detection, or in clustering (segmentation).

In each of these four examples, we see an illustration of condensed representations of the big data. Such characterizations are compact surrogates for the full data stream. They enable us to run our discovery algorithms more efficiently (time-to-solution) and more effectively (on a larger portion of our data set – hopefully on the complete data set). Consequently, a lot of information contained in the full data set is being ignored here. This demonstrates that we need to be careful and do some exploratory analysis before we turn our characterizations into action within any automated data pipeline processing system. We should verify that the characterizations are truly representative of the data behaviors that are important for our business goals, and we should validate that the characterizations are efficacious in identifying the expected discoveries, events, classifications, and outcomes within an independent test data set. If we are careful, methodical, and scientific in our approach to selecting and measuring characterizations, then we should find them both an efficient and an effective means for data-driven discovery from our big data collections.

An excellent data science technique to apply to your data characterizations and to begin exploring your big data for discoveries is anomaly detection, which I prefer to call “Surprise Discovery” = finding the novel, unexpected, and surprising thing, or pattern, or behavior in your data – the unknown unknowns! For more information on this topic and some how-to advice, check out the new O’Reilly publication by Ted Dunning (Chief Application Architect at MapR) and Ellen Friedman: _“Practical Machine Learning - A New Look at Anomaly Detection.”_ Start characterizing and start discovering surprising patterns, trends, and associations in your big data collections today! If you need some help with that, try out theMapR App Gallery to get maximum value and maximum ROI (Return On Innovation) from your big data and all of those characterizations that you create.

This blog post was published June 12, 2014.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now