Potent Trio: Big Data, Hadoop, and Finance Analytics

Contributed by

8 min read

Big data is a universal phenomenon. Every business sector and aspect of society is being touched by the expanding flood of information from sensors, social networks, and streaming data sources. The financial sector is riding this wave as well. We examine here some of the features and benefits of Hadoop (and its family of tools and services) that enable large-scale data processing in finance (and consequently in nearly every other sector).

Three of the greatest benefits of big data are discovery, improved decision support, and greater return on innovation. In the world of finance, these also represent critical business functions:

  • Discovery of fraud, customer insights, and new revenue sources

  • Better decisions regarding risk and investments

  • Reaping significant ROI on “low-hanging fruit” through innovative analytics on big data sources that are plentiful in the domain of finance, which is naturally a world of numbers!

When confronted with the inevitable avalanche of financial data from many business and customer channels, the modern data-driven firm can find help in the supporting technologies that comprise the Hadoop ecosystem. Hadoop provides much-needed functionality in several areas for the business data analyst. These functions include big data storage, access, warehousing, query, and processing (mining and analytics).

The Hadoop Family

The Hadoop Distributed File System (HDFS, for storage), HBase (for read/write access and database-like querying), Hive (for data warehouse functionality), and Pig (for processing and workflow management) have been around for a while. In addition to these, there are now some new tools and techniques in the Hadoop toolkit.

One of the most recent additions to the Hadoop family is Spark. Spark is a fast general purpose engine for large-scale data processing. Spark speeds up processing by enabling parallel, complex, interactive, in-memory calculations on big data. Spark also provides capabilities for interactive querying, machine learning, graph processing, and stream processing. As financial data streams increase not only in size, but also in real-time response requirements, the opportunities to use Spark will only increase in the months and years ahead.

Another powerful member of the Hadoop stack is Drill. Drill allows financial data analysts to perform what they love the most: interactive self-service ad hoc analyses! These analyses can now be performed on a large scale using Drill, which enables analytics across billions of records. The SQL capabilities of Drill provide a familiarity that we can all appreciate. But it doesn’t stop there. Usually, when we mention “SQL,” we tend to think of relational (schema-based) databases. But Drill can query schema-less datasets as well. This is referred to as NOSQL.

A common misconception is that NoSQL means “No SQL”. That is not accurate. It is actually an abbreviation for “Not Only SQL,” which offers a perfect expression of Drill’s versatility. A common data format that is schema-less is JSON (JavaScript Object Notation), but any other data object that consists of key-value pairs can be processed by Hadoop or queried by Drill. A simple key-value pair may have this form: (item_id, item_key, item_value). Here is an example: (blog001, “author”, “Kirk Borne”), (blog001, “topic”, “big data”), and so on. It is flexible, extensible, and scalable.

A flat file containing key-value pairs can be easily constructed, incrementally updated, quickly edited, and readily partitioned to different processing nodes on a Hadoop cluster. All of this can be done without the time sink of rebuilding database indices, or modifying the schema, or re-normalizing the database relations. NOSQL and schema-less data processing tools are a great gift to big data users.

The third relatively recent addition to the Hadoop ecosystem is YARN (Yet Another Resource Negotiator, sometimes called “MapReduce 2.0”). YARN is a convenient job manager (workflow coordinator) that handles resource management, job scheduling, and job monitoring in a much better way than traditional MapReduce. When an analyst has many jobs to execute or many partitions of a large dataset to process in parallel, then YARN can make the work-load so much more manageable.

Applying these Hadoop tools and technologies to financial data assets is powerful medicine for the big data headache of high-volume, high-speed, and high-complexity datasets. We consider next two financial use cases: recommendations and risk analysis.

Recommendation Engines

One of the greatest areas where big data is touching everyone is in customer analytics. This includes personalization, target marketing, and customization of end-user products, offers, campaigns, and experiences. Identifying and responding to the specific needs and preferences of individual customers is now required for a modern digital business. The customer expects it!

Recommendation engines are one tool that data scientists have developed for customer analytics. The recommender offers products and services to a customer based on one or more factors: content, context, and collaborative filtering:

  • A content-based recommendation presents an offer (or campaign) to a customer that is similar in content to prior offers, products, or purchases that the customer previously had a favorable response to.

  • A collaborative filter-based recommendation presents an offer that similar customers have liked in the past.

  • A context-based recommendation presents an offer that is based on the customer’s context (such as time of day; geolocation; at work vs. at home; mobile vs. desktop; or responding to an email campaign vs. arrived at your business login page through natural search).

In financial services, these recommendation tools are likely to generate stronger customer loyalty, greater revenue, and improved services through digital marketing automation.

Risk Analytics

Risk appears in many dimensions: fraud, money laundering, market fluctuations and uncertainty, external (global) factors, internal factors, data theft, and more. Building predictive and prescriptive analytics models of these behaviors and outcomes can help guide the financial analyst toward solutions and away from danger zones.

Prescriptive analytics is an example where risk mitigation and customer recommendations come together. In the more common predictive analytics use cases, the analyst attempts to predict the behavior or outcome of a given situation based upon prior history. In prescriptive analytics, analysts try to identify which actions they can take that will yield the best, optimal, and/or most preferred outcome. The desired outcomes would certainly include reduced risk, greater margins, and increased customer satisfaction.

Hadoop for Financial Analytics

Finally, we note that many analytics functions that we perform on data can be executed in parallel on small, compact partitions of the big data collection. The partitions may be apportioned by geographic region, or internal divisions, or customer categories, or types of financial instruments, or risk categories, or semantically segmented in some manner. In all of these cases, Hadoop clusters are standing ready to crunch massive quantities of financial and customer numbers in fast parallel processes, yielding informative answers to analysts anywhere. The scalability (to large data volumes), extensibility (to large numbers of processors), and versatility (to perform a variety of different inquiries into the data) are all strong reasons to tap into the big benefits of big data and Hadoop-based analytics in financial services. The MapR family of products and services delivers top tools to this potent trio: Big Data, Hadoop, and Financial Services.

This blog post was published November 10, 2014.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now