Apache Spark Use Case for Better Drug Discovery – Whiteboard Walkthrough

Contributed by

5 min read

In this week's Whiteboard Walkthrough, Steve Wooledge, VP of Industry Solutions at MapR, talks about an Apache Sark + Hadoop use case for drug discovery that one of our customers is currently running in production.

Here's the unedited transcription:

Hi, my name is Steve Wooledge. I'm VP of industry solutions in MapR and I want to talk about how a pharmaceutical company is using Hadoop to improve drug discovery through a couple different use cases within the platform. The first use case involves next generation genome sequencing information, which we've been collecting from the human genome for a long time, and historically people have been using high-performance computing applications to bring in this genome data and create models on top of that through different tools that allow them to do processing of that information and look for different trends. As people look at collecting more and more information about individuals and genomes, there's just a lot more data, so people are looking to Hadoop, which is a natural tool that can collect all that information.

In the case of this customer, Novartis, they were able to bring in this NGS data into a Hadoop platform. In this case, they chose MapR. One of the reasons for that is it provides a POSIX NFS interface, which allows them to take those existing high performance computing application models and put them directly on top of the buster and have that read/write real-time file system underneath to be at a service ... Those high-performance computing application models, which are then used by the bioinformaticians for things like video analysis, the study of proteins, probiomics, and metagenomics. In this case, they were replacing an existing work flow, which could take a lot of time to parallelize these models and do the complicated book keeping logic to do that parallelization, bring it on to a parallel platform, like Hadoop, and leverage the POSIX NFS interface as provided by MapR to use those models easily. That's phase one of what they were doing.

The next phase is there's a lot of heterogeneous data that's out there available through public data sets that researchers have access to to learn from previous experiments around genomics... Things like 1000 Genomes, the National Institute of Healths, Genotype-Tissues Expression database, the Cancer Genome Atlas, and bring all this information together so that different type of researchers have access to that, can blend it with other genomic data that they have. With the MapR distribution of Hadoop, be able to bring all that information into one place, join it with other genome data that they had, store it in one location, leverage their existing high-performance computing models, but then in this case, they wanted to create a large graph through Custom Spark Logic using the Spark framework. Spark gives them access to not only the RDD framework and all of the in-memory computing that can happen there, but a library of different components. There's Spark SQL for SQL access, there's a machine learning library, and there's also Graph X.

What they're able to do is bring all this heterogeneous data in, create a very large graph with trillions of edges that could provide that information to the bioinfomaticians through both the Spark API as well as being able to export it into thousands of different end-point databases that might have a tailored schema that's perfect for the type of analysis that a researcher wants to do and they could literally load billions of rows from that graph directly into those data sources. That then feeds a lot of these same life science researchers with information from these heterogeneous data sources all from one data platform that doesn't require you to move data from multiple places, store it permanently in a relational format. They have a lot more flexibility using Spark on top of the Hadoop distribution and leveraging the NFS or the HDFS API's that MapR exposes and makes available. You can do lots of different types of processing on a shared infrastructure to lower costs and give more value and performance from the analytics.

That's the summary on how Novartis is looking at Hadoop and if you want more information about Novartis, it's on our website – thanks for tuning in.

This blog post was published September 10, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now