Game-Changing Real-time Use Cases with Apache Spark on Hadoop

Contributed by

7 min read

Apache Spark on Hadoop is great for processing large amounts of data quickly. The story gets even better when you get into the realm of real time applications. If your business depends on making decisions quickly, you should definitely consider the MapR distribution including Apache Hadoop, which packages the complete Spark stack including Spark Streaming, MLLib, GraphX and Spark SQL.

Here are some amazing, game-changing uses for real-time big data processing with Spark on Hadoop.

Credit Card Fraud Detection

Your credit card is swiped, the receipt is signed, something is bought. Only it wasn’t you who made the transaction. Perhaps your wallet was stolen. Perhaps some hackers stole your information from another site. Maybe a credit card skimmer at the local gas station got your number. However it happens, it’s credit card fraud, and credit card companies want to know when and where it’s happening so they can stop it.

Banks and credit card companies are on the hook to resolve fraudulent charges at the earliest, so they want to detect/block as many of them as soon as possible. They already have sophisticated mathematical models to detect possible bogus transactions, but these models are applied in a batch environment at a later stage. How does one deploy them in real-time?

Apache Spark Streaming, running on Hadoop, makes it possible for banks to process transactions and detect fraud in real time against previously identified fraud footprints. Within Spark, in-coming transaction feeds are checked against a known database and if there is a match, a real-time trigger can be set up to the call center personnel who can then validate the transaction instantly with the credit card owner. If not, the data is stored on Hadoop, where it can be used to continuously update the models in the background through deeper machine learning.

Network Security

Network security is at the top of nearly every business’s agenda, especially after all the high-profile data breaches of the last few years. For instance, hackers can commandeer thousands and millions of computers over the Internet into “botnets” to cause Distributed Denial of Service (DDoS) attacks, steal credit card information, and otherwise wreak havoc with information.

Wouldn’t it be nice if there were a way to detect security problems in real-time? That’s exactly what a global managed security services provider is doing with its security service that runs on Hadoop.

The provider uses different components of the Spark stack to examine packets for traces of malicious activity in real time. At the front end, it uses Spark Streaming to check against known threats before passing the packets on to the storage platform where the data will be further processed using other packages such as GraphX and MLLib.

Hackers move quickly, always trying to stay one step ahead of IT departments. With machine learning and stream-processing solutions, systems can keep learning about new threats as they evolve, protecting clients in real time.

Genomic Sequencing

The 20th century saw a staggering reduction in the number of people dying of various diseases. It was also the century that medicine discovered DNA. In the 21st century, genomic engineering looks to offer a new renaissance of medicine.

The only problem is that DNA sequencing genomes require vast amounts of computing power. For instance, the latest “next-gen” DNA sequencer, the Illumina XTen, produces 6 terabytes of data per day. For medical-grade data, scientists need to sequence 300 billion base pairs. That really puts the “big” in “Big Data.”

Not only is there a lot of data, it takes a long time to process, even on the biggest and fastest HPC clusters available. NextGen Genomic companies are using the power of distributed storage and compute through Spark on Hadoop to drastically reduce the time it needs to process genome data. For instance, it used to take them several weeks to align chemical compounds with genes. Now it only takes geneticists a few hours. Although it is not nearly as real-time as the other use cases, the dramatic reduction in the time to process genomic data is still a major benefit for the researchers.

Real-Time Ad Processing

In the series Mad Men, Harry Crane, the bespectacled, neglected head of the Media Department for Sterling, Cooper, and Partners, constantly complains about the lack of a computer of the firm’s own. The company eventually gets an IBM mainframe, but Crane would be even more jealous of the real-time processing capabilities that advertisers have today—and are far beyond what anyone in the 1960s could have imagined.

One advertising firm uses Spark, based on MapR Database, to build a real-time ad-targeting platform. The system looks matches user behavior with historical patterns and decides which ads to show users on the internet. Since advertising is so time-sensitive, advertisers have to move fast if they want to capture mindshare. Spark on Hadoop is one way to help them achieve that.


While genomics offers one way to revolutionize the healthcare industry, healthcare providers are always looking to make healthcare more efficient. One way is to try to prevent hospital re-admittance. One provider uses Spark to examine patient records combined with historical clinical information to find out in real-time (at checkout time) who’s most likely to have complications after being released from the hospital. Then they can deploy home healthcare services to prevent re-admission, saving on costs for both patients and hospitals.


As you can see, there is a diverse range of real-time uses for the Spark stack with the MapR distribution for Hadoop. Feel free to share your real-time use case with Spark.

This blog post was published August 11, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now