Joining Streams and NoSQL Tables for Customer 360 Analytics in Spark

Contributed by

5 min read

I’d like to share one of the demos I’m preparing to show at the Strata Data Conference in New York during the week of September 10th. My objective with this demo is to illustrate how certain database capabilities, such as secondary indexes and integration with Apache Spark, can improve the effectiveness of Customer 360 solutions. I really like the Customer 360 use case because it talks to many of MapR’s strengths, even beyond the database, and it can be communicated in a GUI that is both visually engaging, as shown in the following screenshot, and technically deep.


The three points I’m trying to convey with this demo are:

  1. Flexible schemas make it possible for Customer 360 applications to save attributes which exist for some customers but not for others.
  2. The MapR Database connector for Apache Spark makes it possible to update large customer relationship management (CRM) databases quickly and without data movement.
  3. Analytical insights are easier to operationalize when production applications share the same data platform as analytical tools.

How are flexible schemas beneficial for Customer 360?

MapR Database is a scalable and resilient NoSQL database that allows different attributes for different customers to be saved in the same table. This enables you to save customer insights derived by joining datasets, regardless of whether those insights relate to all or a portion of your customer base.

This is useful, for example, because in order to maintain a comprehensive view into the preferences of your customers you might need to capture the data they expose through activities in different places, like on social media or in your organization’s mobile app. However, not all customers may use social media or your mobile app. This leads to sparsity in columnar data tables, which can reduce the performance and usability of relational databases. A NoSQL database like MapR Database can store data for all customers in one table even if different columns are used for each customer.

How do database connectors for Spark improve the agility of CRM analytics?

Apache Spark is a leading technology for processing large datasets. However, the speed in which you can apply analytical insights on Big Data is greatly compromised when those datasets must be moved into and out of Spark execution engines. The MapR Database connector for Spark solves this problem by enabling Spark to save and update data in MapR Database without data movement.

What’s the fastest way to operationalize insights from AI and advanced analytics?

When organizations attempt to build Customer 360 solutions they often fail to operationalize analytical insights by saving them back into CRM tables. So, the third major talking point in my demo relates to how we can take an output like churn prediction from machine learning (ML) and load it back into master CRM tables so those insights become instantly accessible by production applications. When ML processes and production applications share a common data platform like MapR, it's much easier to operationalize ML insights. This is part of what many people at MapR refer to as, "the power of convergence".

Clickstream analytics in Zeppelin

I wrote a Zeppelin notebook (viewable here) that illustrates how to use Spark on MapR for clickstream analysis. This notebook walks through the following steps to demonstrate how Spark SQL, Spark Streaming, and Spark ML can be used with MapR Database, MapR Event Store, and the Distributed File and Object Store:

  1. Consume clickstream data from MapR Event Store with the Kafka API
  2. Analyze web traffic using Spark SQL
  3. Join clickstream and CRM data with the MapR Database Spark connector
  4. Predict churn based on the joined clickstream and CRM data and save churn predictions back into MapR Database tables so those insights can be acted upon immediately by production applications.

The code excerpts below show how those four tasks were implemented.

To download the code and our Customer 360 demo, see the GitHub repository at

Loading an RDD from MapR Database

Loading an RDD from a Topic in MapR Event Store (Formerly MapR Streams) Using the Kafka API

Joining RDDs Loaded from MapR Event Store and MapR Database

Bulk Saving an RDD to MapR Database

Additional Resources

This blog post was published July 19, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now