Beyond Genome Sequencing - the Big Data Problem


Jaideep Joshi

Solutions Engineer, MapR

Pinakin Patel

Head of Solutions Engineering, MapR

Though next-generation sequencing (NGS) is reducing the cost of genome sequencing, downstream analysis of this data is still an enormous task.

How do you analyze genome data with other enterprise data to derive actionable insight? By using big data frameworks on commodity hardware with open source software and adopting data science tools like AI and machine learning.

We invite you to learn how MapR can help you solve the big data, big compute, big storage problems associated with NGS. In this webinar, we highlight a reference architecture that includes:

  • Open-source workflow definition language for pipeline definition and execution engine
  • Toolkits that exploit parallelism in the process
  • Advanced analytics including ML/AI


Jaideep: 00:01 Good afternoon, everyone. Thank you for attending this seminar today. My name is Jaideep Joshi, I'm a senior solutions engineer with MapR. And just from a housekeeping perspective, please if you have any questions type them in the chat space, we'll try to get as many as we can in the time that we have, if not we'll certainly follow up.

Jaideep: 00:26 Once again, today we are here to talk about genome sequencing, especially what it means in terms of the analytic space, not necessarily the sequencing space, and how technology such as MapR can enable high speed analytics at a much lower TCL.

Jaideep: 00:51 Just as a quick level set, a lot of people already probably know this but just to get everybody on the same page, what a genome sequencing. Very clearly you take a specimen from a, you take a specimen, you take a sample from a specimen, there's some chemistry involved in prepping the sample, you process through a sequencer and you get these reads or the short reads that make up the genome.

Jaideep: 01:19 Very simply put this is a pretty inexpensive process now-a-days. There is an institute that monitors the cost of genome sequencing activities and as it stands we're less than a $1,000 in sequencing activity and falling. Clearly there is a technology push that has enabled the reduction in cost and consequently the more and more sequencing activities that are being performed.

Jaideep: 01:55 Not just human genomes, but clearly human genomes are of an immense interest, but other areas of probably agro based or veterinary based, etc.

Jaideep: 02:09 There is also a huge amount of data that is being generated as more and more sequencing activities are started to ramp up. There is an interesting paper that was published that compared genomic data with astronomy data and compared it with internet data and it's worth a read just to get a sense for the amount of data that the world stands to produce in the near future as these sequencing activities become more rampant and become more prevalent.

Jaideep: 02:53 And as a data point, right, a standard and by standard it would be debatable, but a 30X sequencing activity with 30 times the coverage produces about 150 gig worth of data for human genome. That's about 80 gig in FastQ and 80 gig in band if those formats are familiar to anyone in LAN, but that's typically what is being produced out of the sequencer before the analytics stuff.

Jaideep: 03:29 If you also look at the different types of sequences and the different models of sequences and get a gist of one example from Illumina which is a pretty reputable sequencer manufacturer and this is what we hear from our customers when we interface with them as to the amount of data per run that these sequencers are producing. It's clear the the amount of data is tremendous based on the model as well as the coverage. And the downstream of problem obviously then becomes well how do you analyze that data.

Jaideep: 04:07 That is where we truly believe that beyond sequencing, this is a typical big data problem. Both from a storage perspective as well as from the analytics perspective. And we want to quickly show you how MapR with it's 10 ways data platform helps solve some of these problems.

Jaideep: 04:27 But I want to take a step back and quickly illustrate the different processes that are involved when people talk about genome analysis. Very broadly speaking, that is a set of activities that constitute what would be called upstream analysis, and there is a bunch of steps that is called downstream analysis.

Jaideep: 04:54 Typically the upstream analysis involved tools and tool kits in this case for example, I'm using GATK as a reference. GATK is a pretty popular tool kit. And in the upstream analysis phase using GATK, we are talking about getting sequencing data which is typically as I mentioned BCL, fastq, or Ban files and then doing some kind of variant calling on it, whether it germ line, semantic, what have you and creating that variant call file.

Jaideep: 05:28 A lot of storage, a lot of transformation, a lot of cleanup happens in this stage and you end up with that variant call file.

Jaideep: 05:37 Very typical representative work flow, a lot of steps that are involved in such a typical pipeline that is work that is considered as the analytics problem.

Jaideep: 05:51 Now there is still downstream analysis, which is where you take the VCF and possibly some other covert VCF and you do some analysis with that and there are again some tools like Hail, etc that are out there, but a lot of people also build there own tools, but we shall leave that for another day.

Jaideep: 06:11 Today our focus is going to be the problem of doing that analysis to get into typically that variant calling end state.

Jaideep: 06:23 Just as a touch point if people are not familiar with that GATK is, it's the genome analysis toolkit that has been released by the Broad Institute, as of 2018, or late 2017, it is now fully open sourced. It is a pretty well established toolkit with a lot of different tools and it easily integrates with other toolkits and toolkits like Picard and what have you.

Jaideep: 06:53 Because of it's open source licensing and the fact that the pipeline that you can build with this toolkit are very simple using a very easy to implement and understand workflow definition language, it's gaining more and more popularity. By no means, this is the only toolkit, it is one. There is Adam which is another. And there are of course commercial ones.

Jaideep: 07:20 But just as an example, we're using GATK4 for this discussion. A typically GATK4 pipeline if people are not familiar, looks something like this. There is a data pre-processing phase and then there's a variant discovery phase. And in each phase there is a bunch of steps. And these steps are typically sequential and you need to do some quality control as these are being performed so that you are confident when you get to the end state, that what you have is pretty accurate.

Jaideep: 07:54 It's a pretty robust, yet sequential and compute intensive process. That essentially has to happen each time you want a variant call file at the end.

Jaideep: 08:10 What is the analysis bottleneck and what is it that we are trying to solve in this talk today. As an example, it takes about 30 hours on a 24 core machine. CPU based. To run these 20, 22 steps on the whole genome to produce what would be considered your variant call file. Using GATK.

Jaideep: 08:38 Now there have been a lot of advancements in certain areas and one of two that are of interest for this talk, by no means are these the only two things that affect the overall timeline, there could be others, but two of interest for this talk. There's Apache Spark and the advancement of the GPU/FPGA.

Jaideep: 09:04 If we were to just take a second and talk about Apache Spark. There are a lot of steps within the pre-processing phase or in general within the GATK4 pipeline, which could greatly benefit by the parallelism and in-memory processing that Apache Spark has to offer. And by that we mean, that benefit is in the reduction of time. Instead of a sequential processing of some of the tools, if we can parallelism some of that processing, there is a considerable amount of reduction in time in these tools to finish that task.

Jaideep: 09:50 A lot of the tools, let's say, within the GATK toolkit or within competitor other toolkits are becoming what I would consider Sparkware or Sparkified, or being modified to exploit the Spark capabilities.

Jaideep: 10:11 That is one data point.

Jaideep: 10:13 Another is hardware acceleration. There is a lot of work that is being done to deploy GPU based acceleration or FPGA based acceleration to some of these tasks in a given pipeline, again to speed up the outcome of particle or tool. There are open source options available, there are also commercial and proprietary options that are available that enable you to use this kind of hardware acceleration where appropriate.

Jaideep: 10:49 By no means, will Spark solve all problems, by no means can FPGA based acceleration or GPU based acceleration can solve all problems but where applicable the two show tremendous promise.

Jaideep: 11:04 Of course, there is the cost factor when we talk about GPU/FPGA because of the hardware that's involved.

Jaideep: 11:14 It can be easily summarized that the optimization in this analysis phase definitely points to benefits that parallelism as well as hardware acceleration can bring to the table. But then it is also important to take one step back and while that is great for the analysis phase, what is the big picture, right. Why does all this matter. The end of the day, there could be an argument that is being made that all this is being done with the goal of providing some kind of predictive insight in the life science and healthcare realm, or in some of a concept of a personalized medicine structure.

Jaideep: 12:08 And we believe, this is based on the industry, that this kind of predictive insight and this kind of personalized medicine activities clearly depends on more than just genomic data. It really depends on all the pertinent data. And when we say all the pertinent data, what could that be.

Jaideep: 12:31 It is probably EMR EHR data, it is probably external covert genome data, it is imaging data, that probably can be correlated to produce some action of insight in the form of the predictive insight or personalized medicine recommendations.

Jaideep: 12:52 And truly you want, or need, a platform that can enable, not just analysis phase as it applies to the sequencing activities, but also the correlation activities when it pertains to other data.

Jaideep: 13:10 What does a MapR solution look like. To start off with, we want to store sequencing data. we want to in the middle in term induced phase, do quick analysis. Do what is termed as variant discovery. We want to move on to some kind of correlation activities which are typically in the downstream analytics and the MLAI phase. We want to potentially develop some applications that then can deliver that insight onto a user or to a provider and then what have you in terms of tying all these parts and pieces together.

Jaideep: 13:53 It starts off with a storage service on an infrastructure that can store the data. This data essentially will come from your sequencer that'll use standard protocols to write data to a storage service that is provided by what we call the MapR converse data platform and the storage services are provided in heterogenous format and they can be accessed using a variety of API's, we support a variety of computational methods. We also support a variety of programming languages.

Jaideep: 14:34 We provide the enterprise features that you'd expect when you are storing data, when you are transforming data, as it pertains to security, as it pertains to auditing, HADR, etc.

Jaideep: 14:52 There is a concept of another set of compute nodes, like you would have in the cloud or in a cloud environment which are used for processing and analyzing the data in question. There is a variety of tools, toolkits, things like docker that are essentially used to produce the insight to do the analysis of this data.

Jaideep: 15:21 In a typical workflow there is a pipeline that has been created by a scientist that gets essentially sent to a cluster of nodes that are what you would call the worker nodes, and these nodes will then run through let's say the data pre-processing phase and the data is essentially transformed from one stage to another in that sequential manner and all the way into variant discovery to produce that variant call file.

Jaideep: 15:51 In these cases where applicable, Spark and Sparkonyarn will be used to run the Spark jobs within the MapR infrastructure, the typically big data infrastructure whatever needed, GPU/FPGU resources will be deployed.

Jaideep: 16:12 There is a concept like we mentioned about correlating data MapR being a pretty robust big data platform, can accept data ingests from a variety of external sources, these would then essentially be correlated using ML/AI techniques in a typical data science like environment to get that insight whether it's a space in 360 or something else similar.

Jaideep: 16:41 What is also critical is a lot of this intermediary data that gets produced is not necessarily useful on a daily basis, but people don't want to go through the intermediary steps again and that data needs to be archived and kept so that it can be recalled in a pretty easy manner. That is something that we do out of the box, using our curing capabilities and that can be cleared off to an object storage or an FT like low storage medium what have you.

Jaideep: 17:12 We also support Kubernetes. Where environments like DevOps are building new applications to deliver new insight or actionable insight to the business also in the space of MapR, we have what we call as a data fibric plug in, that provides assistant storage to all the docker containers, etc that run in your DevOps environment making development life cycles much shorter and easier.

Jaideep: 17:41 And then Spark and Spark based downstream pathways is already being developed, like things like Hail, we also have the means to run that within the same cluster.

Jaideep: 17:53 What MapR offers in a pretty elaborate and comprehensive platform to not only take the data from a sequencer and analyze it in the immediate phase, but also help in the overall process of correlating data and delivering what we have as the insight to whoever. Right. Whether it's the researcher, whether it is a payer, whether it is a provider or the patient.

Jaideep: 18:24 Truly from a value proposition perspective we believe we break down all the silos that are associated with data storage as the data is essentially morphed, transformed, correlated, we have now a single depository for all analytic purposes.

Jaideep: 18:44 We also tie in and use the existing open source tools as well as the Apache Spark, the analysis can be done as and when needed by whatever researcher without trapping you into some kind of proprietary mechanism to do so.

Jaideep: 19:06 If hardware accel rate is of interest then we definitely support that with partnerships of [inaudible 00:19:14] Nvidia, or as well as whatever the hardware provider of your choice.

Jaideep: 19:19 And then everything today runs better, more manageable with tools like Docker, Kubernetes, we support that out of the box and have a pretty robust offering around Docker and Kubernetes as well.

Jaideep: 19:34 And then ML/AI are pretty becoming increasingly ubiquitous in all areas of analytics but especially in these cases, when you have to correlate multiple types of data, data sets, and we support that out of the box with a robust offering in terms of capabilities, programming language support, notebooks, etc. The benefits obviously speak to themselves, but in terms of a role within the organization, scientists particular want to deploy something that is easy, that is what you would consider low-touch IT, can bring in open source innovation, and gives scientist or researcher access to all the data as opposed to just certain parts of the data.

Jaideep: 20:25 From IT stakeholder perspective, it's important to have reproducible architectures with enterprise features. People are not trying to build something that is hard to support, hard to maintain.

Jaideep: 20:39 And clearly from a business perspective, you want to have predictable scale, predictable performance, integration with other data sources, data sets, whatever your industry requirements are and to produce that in a quick fashion. With that, we'll open up for questions if there are any. I see some of them in the chat.

Jaideep: 21:16 All right, there's one question that we typically get that what is the improvement in speed or reduction in time in this process with an infrastructure component like MapR. It is very hard to predict exactly what the reduction in time is, because at the end of the day it is a performance and scale problem that we're trying to solve and the more resources we have available to the different tasks within the toolkit that is where the mileage varies. But we do see a significant reduction in the overall TCO as compared to your traditional approach of running monolithic storage systems and some kind of large HPC cluster. While the reduction in time would be dependent on the scale of the size of the cluster and the toolkit the reduction in TCO is pretty accurate.

Jaideep: 22:20 All right, I think with that, oh there's another question. Would you comment on how many users make use of this platform. I'm going to assume that with that, you mean how many users can concurrently make use of this platform and if so, this is a typical big data platform. That is deployed in many, many customer sites today with the capability to handle hundreds and thousands of concurrent users and not only users but users of different types. This is certainly technology that is established and proven in the field today.

Jaideep: 23:11 Another question we have is how active is the open source community in developing processing framework around the solutions. Well, with the release of GATK as an open source toolkit there is more and more interest there have been parallel tools and toolkits they use at the University of California at Berkeley has developed the Adam Project along with a toolkit that is pretty robust and is gaining just as much traction, but this is an area that continues to explode with innovation, especially in the open source because sequencing costs has reduced drastically.

Jaideep: 23:57 For sure we'll see more aligners, more mappers, more variant callers that become open source, there's probably one or two if not more every quarter that get released and it's just a matter of which industry and which interest are appropriate for adopting one or the other.

Jaideep: 24:19 And I think that gets back to the correlation that if you have that much of open source innovation and likely rightfully as somebody has commented you want a platform that is capable of integrating with these open source communities and the open source tools so you can exploit and get the best of both.

Jaideep: 24:43 We believe that MapR and the converged data platform you do truly have an enterprise class platform with enterprise class features that can accommodate the open source community pretty easily. Thank you and I urge to please continue the conversation with MapR,, we can be reached using a variety of means, methods, and we look forward to hearing from you.

Jaideep: 25:14 Thank you again.