Industry Solutions Architect, MapR
In the past, big data was interacted with in batch on a once-a-day basis. Now data is dynamic and data driven businesses need instant results from continuously changing data. Data Pipelines, which combine real-time Stream processing with the collection, analysis and storage of large amounts of data, enable modern, real-time applications, analytics and reporting.
In this talk we will:
Carol McDonald: 00:01 Okay. Can you see this okay?
David: 00:04 Yes. We can.
Carol McDonald: 00:06 Okay, so this is what we're going to go over today. First we're going to go over a brief overview of the Kafka API. Then we're going to go over Spark structured streaming continuously reading from the Kafka topic, then transforming the data write to the MapR document database, and analyzing the continuously arriving data using Spark SQL.
Carol McDonald: 00:27 We're going to look at two data sets in this talk. First, we're going to look at the open payment data set, and this is the open payment in the federal programs that collect information about the payments drug and device companies make to physicians and hospitals for things like travel, research, speaking fees, and meals.
Carol McDonald: 00:47 We're also going to look at another data set, the Inpatient Payment Dataset or IPPS, and this one is since 2013, an effort to increase price transparency to CMS price medicare payment data for the most common hospital inpatient diagnoses. And this data shows dramatically different charges and payments, which makes it interesting.
Carol McDonald: 01:11 First, just a brief overview of what is streaming data. So what is a stream? A stream is a continuous sequence of events that go from producers to consumers. And these events with Kafka are key-value pair. Some examples of streaming data industry learning from the latest news, so this is an example from computer weekly about data for monitoring devices such as blood pressure, pulse rates, and oxygen levels combined with algorithms can provide alerts for Sepsis, which is one of the leading causes of death in hospitals.
Carol McDonald: 01:49 Here's another example from Apple. They have this EKG app on the Apple watch that can check heart rhythms and send a notification if there an irregular heart rhythm to be identified.
Carol McDonald: 02:01 And this similar to the example I did with a data scientist at MapR, and you can download this and read about this study about applying machine learning to live patient data for EKGs, so to track arrhythmia.
Carol McDonald: 02:17 Next, I'm going to go over the Kafka API and streaming data. So looking at this part of our application. Kafkas are logical collections of messages which organize your events into categories and decouple the producers from the consumers. The topics are partitioned for throughput and scalability. These partitions make the topic scalable by spreading the load for a topic across multiple servers. You can see a partition like an even log or a queue. The new messages are added at the end. However, so the messages are like a queue in that they're delivered in the order that they are received. So here we see the consumers they can be reading different parts in the partition, and they're reading in the order that they're received.
Carol McDonald: 03:12 Unlike a queue however, events are not deleted when they're read. They remain on the partition available to other consumers. These messages can be persisted forever or they can be automatically expired based on time to live. Not deleting these messages when they are read, minimizes the disk reads and writes which can stop your performance.
Carol McDonald: 03:37 Not deleting the messages just when they're read also allows for processing of the same messages by different consumers for different purposes or different views. So here we see that messages are providing different views, provided by MapR Database HBASE, [inaudible 00:03:55] or searching.
Carol McDonald: 04:00 Next looking at SPARK structured streaming, before we look into structured streaming with the datasets, we'll just give a brief overview of what is a dataset. A dataset is a distributed collection of objects partitioned across the multiple nodes in a cluster. And these operated all in parallel and they can be cached. Here, we see the status that is being read from disk by task into the partitions in memory.
Carol McDonald: 04:30 A data frame is a dataset of row objects. You can think of this like a table, with the columns and rows like a table except it's partitioned across a cluster. So here we see that these partitions are in memory across the nodes on the cluster.
Carol McDonald: 04:48 And again the dataset is a collection of objects, so the difference between a dataset and a data frame is with a dataset you can use Spark SQL and functions. And with a data frame, you just use Spark SQL.
Carol McDonald: 05:06 So now moving on to structured streaming, this shows how the data spread from now from topic partitions by sort task into the partitions of memory, so the cache. So the cache and is the process and memory.
Carol McDonald: 05:15 Spark structured streaming is a scalable stream processing engine built on the Spark SQL engine. It enables you to view the data coupled from the cache in an unbounded data frame and process this data frame with same data frame data set and SQ options API's use for batch processing. You can see it as an unbounded table or data frame with structured streaming.
Carol McDonald: 05:54 As the streaming data continues to arrive, the Spark SQL engines incrementally and continuously processes it. In this example we see events coming in for accounts, causes, and withdrawals. So positive 100 and withdrawal thirty for example. And then we have a query that continuously processes to get the sum of these amounts which is stored in a database. So here we just have the sum of all these events that are in the stream continuously being processed.
Carol McDonald: 06:28 The stream processing events is useful for real time ETL. Filtering, transforming, and working with machine learning, and writing about the file databases or through different topics or data type lines.
Carol McDonald: 06:39 This is what we're going to do with our use case. We have this data coming in, streaming is CSP format, so comma separated values, and there's a lot of the data that we're not going to use. So what we're going to do is we're going to transform this into JSON format and store it in a database for continuous processing with SQL, and this is just what it looks like when we're going to store it. So this is what we're going to store. We're going to store the provider ID, the date, the payer, the payer state, the provider specialty, the provider state, the amount, and the nature of the payment.
Carol McDonald: 07:22 This is how it would be to stream with the Kafka data source. So you specify option site the format is Kafka, and you specify the topic location, where we want to start reader, we're going to read from the earliest offset, and how many methods that we will treat at a time. And the result of this is going to be a Kafka data frame, which looks like this in key value format.
Carol McDonald: 07:56 So just putting out the schema again for the Kafka dat frame, which we see that it consists of key value format, and also the name of the topic, the partition number, the offset number, and a time stamp.
Carol McDonald: 08:03 So the next thing we want to do is we want to get the message from the value, from this key value. In order to do this, what we're going to do is we're going to put it in a payment class. So here we're just trying the payment class, which corresponds to this JSON format that you saw here. So this payment class corresponds to that. And we're deciding the function to parse the comma separated values and to move spaces and other things from the string, and to put it into this payment class.
Carol McDonald: 08:43 So we're using then the function, the parse payment function that we just saw. We're using this in a UDF, so we're registering this UDF and we're going to call this deserialize. That's the name of our function. And it's gonna take in a string and pass it to the parse payment and turn that into a payment class. So then what we're doing here with the data frame that we're getting from Kafka. We're selecting this expression, and we're going to use the UDF deserialize on this message value to parse that and that's going to then return this message as a payment class. So that's just putting this into a data set of payment classes. So that's going to lead to a data set of payments, which we can then analyze with SQL.
Carol McDonald: 09:35 So what we can do just for decluttering purposes is we can put some of this into memory, and then start querying it in memory for decluttering purposes. An example of this would be we can select them from memory to show the payments. So this is showing our data set of payments, four payments, and that's useful for decluttering purposes. And you can't do this very much because then you're going to run out of memory.
Carol McDonald: 10:02 So here's another example, we can do queries for example, here we're selecting the physician specialty. We're getting a cache from the payment data set, and we're grouping this by physician specialty and ordering it by the count.
Carol McDonald: 10:18 So that's just a couple of examples. But what we really want to do is we want to write this to the MapR database so that we can persist this and we can also continuously query it. Because like I said, you're very limited in what you can do with memory.
Carol McDonald: 10:37 This is this is from the Spark MapR Database Connector. This architecture has the connection object here, and every spark object here. So we see here the task which is which is read can connect to this, and this on the left would distribute in parallel write making this fast. With MapR Database this tape would automatically partition across a cluster by key name so that each service would source a subset of a table. Grouping the data by key names provides for really fast data reads and writes by row key.
Carol McDonald: 11:12 This is showing one more of our data space data in the document store, and here we see that the data again is gonna be automatically partitioned and sorted by the row key. In this case, a row key consists of the state, so the physician state and also the physician id and record number. That's the record number. Sorry.
Carol McDonald: 11:41 This is what two rows look like in this MapR Database of this data. And this is what it looks like for writing to the MapR Database's sink. So we have our data frame or data set, we're writing to the write stream and then the options for the table name we specify the primary key, the ID, which is right here. And then I check put location and other specifications. And then we start the query in the way it works determining.
Carol McDonald: 12:24 And so this test that we're doing is we're reading from these topic partitions and this streaming data set, and writing these then into the tablet partition. So this is gonna be happening continuously in parallel, which makes the data rapidly available.
Carol McDonald: 12:41 The next, looking at querying this data in Spark SQL. So next we'll look at this using Spark SQL connecting to the MapR database. So we what specify is we're loading from MapRDB, we're specifying this schema here in that payment class that we had before. And this is gonna return a data set of payment objects which we can then use with SQL. And this just shows how it's being loaded from those partitions from the database partitions into partitions in memory.
Carol McDonald: 13:27 And here's an example, so here what we're doing is we're creating a table view of this so that we can use SQL. So we have two different ways that we can query this data. We can use the data frame methods like select. So these are SQL like methods that you can use. Or we can use SQL. That's because we're specifying this table. So this is just showing, the ID and the payer, and the amount. And it's also showing, here it's sorted by the ID row key, so we're getting all the Arkansas ones first.
Carol McDonald: 14:05 So know we can ask questions like which physician specialties or states are getting the most payments by counts, the total amount, the average amount, or the standard deviation? Which companies or company states are making the most payments by counts, the total amount, or the standard deviation? What are the Top nature of Payments by count and amount? And what are the payments over for example two million dollars which would specifically the large ones.
Carol McDonald: 14:46 This just shows some language integrated queries that you can use on data banks, so aggregations, distinct, filter, where, group By, join, sorting, and select. And we're going to look at this with the MapR data science refinery and the Spark and assessment notebook. We're exiting the spark query. So just some examples, and then I'm going to go to the notebook to show these examples.
Carol McDonald: 15:13 So here's a query for the top five nature patients by count. So taking the data set and we're grouping by the nature of payment. We're getting the count and then we're ordering it by the count. And we show that the food and beverage has the highest count. Then the travel and lodging, consulting fees ...
Carol McDonald: 15:35 Another example, here we have the same query but this one before, that was that query. Here, we're showing it with an SQL and we're also showing that we can share the results in different types of diagram. So here we have a pie chart for, so here we're getting the nature of payment, we're summing the amount as the total, and then we're grouping by the nature of payment.
Carol McDonald: 16:02 So that was by the count, and this one is by the total amount. So they're similar queries. So here we see that the highest amount was for royalty or licenses. Here we have the top physician specialties by total amount. So we're getting the physician specialties and we're summing the amount and then we're grouping by physician specialty, and here we see that Gastroenterology has the highest amount with 51%.
Carol McDonald: 16:33 So then now I'm gonna move on to the second notebook. So here I have the second notebook.This is running in a MapR cluster with the data science refinery. And here I have the code again for the finance schema, the payment and schema reading this from the MapR database into a data set of payments classes. Then I'm getting the count. So here we have five million rows, eight hundred thousand rows. And then I'm creating a view of this, so I can also do SQL coding on this, and I'm caching the data set.
Carol McDonald: 17:27 So here is a select all that shows you what is in the payment data set. So we have the physician id, the date, the payer, the payer state, the amount, the physician specialties, the physician type, and the nature of payment. Here I have the query for the top ten payers by amount and count. And we're selecting the payer space and we're getting the count and also the sum of amounts. And we're grouping it by payer and payer state. And we can see this in a different format, so here for example we have the payer, the payer state, and then we're showing this in the order by the total. So these are the payers that paid the most amount. So the one that paid the most in this case by total amount was DePuy and then Jackson. And we can look at this also by a pie chart. So we can see DePuy paid 35% of the payments. Jackson Pharmaceuticals. We can also have a bar chart for example.
Carol McDonald: 18:42 So the top physician specialty by count and amount. So here we're grouping again by physician specialty and ordering it by the total, the sum of the amount paid, and again we can show this in a table. Here we have orthopedic surgery is getting the most. And a pie chart. Here again orthopedic surgery is getting 33%, neurologic surgery is getting 12%.
Carol McDonald: 19:18 Here the data set we're doing is describe an amount, so this is gonna produce statistics by amount. So here we see the average payment, the mean is 311 dollars. And the standard deviation is about 18 thousand. The minimum was zero, and the max was, I'm not sure exactly what that one is.
Carol McDonald: 19:43 Here we have the top 10 physician specialties by average amount. So this is average amount, so the total amount, which we can also look at in the bar chart. So by average is the body imaging, the doctor of osteopathy.
Carol McDonald: 20:04 Here we have the top physician states by the total amount. And this is showing a bar chart, so here the top physician state is California, then Texas. This is the top payer state by amount, the top was Massachusetts. Here's the top ten nature of payments by count so the top nature payment was food and beverage. Here's the top nature of payments by total amount instead of the counts. Here we have royalty and license was 28%. Compensation for services was the highest. Then we have some more by nature of payments with payments greater than a thousand, and then here the final one is what are the payments over two million dollars. So here was the specific ID, the payer, and the physician specialty for payments which were over two million.
Carol McDonald: 21:25 So then I want to move on to the second data set. So again, this second data set for this one is, this is the inpatient charges and payments by diagnosis for hospitals and medicare. So here again, I'm defining the schema, and taking this from MapR database, and this is what the data set looks like then. So we have the diagnostic code, the definition of diagnosis, so for example heart transplant, the year, I was going to try a count, so here we have four hundred rows in this data set. So this is the year, the provider id, the provider name, the provider's address, state, zip, then we have the total discharges, so that's how many times it occurred, then the average charges, so this is how much the hospital charged, this is how much Medicare paid, this is the average amount that medicare paid. This is the average total payment, so that's including medicare paid and also it's what patients paid. This is what the hospital charged, that's what medicare paid, and that's what medicare and the patient paid for the averages based on this diagnostic codes.
Carol McDonald: 23:01 So now what we can ask are some questions. I had a slide for that too. The inpatient payment data. So here are some questions that we can ask for this one. So here we have questions for example, what are the top charges and payments by hospital or by hospital state? What are the top charges and payments by diagnosis or what's the most common diagnosis? What are the statistics for payments for the most common diagnosis, which is Sepsis.And what are the most expensive states for the most common diagnosis, which is Sepsis.
Carol McDonald: 23:45 So looking at some of these questions. First, we're getting the statistics for the total of those charges, total payments and medicare payments. So here we see the mean for the average charges, the standard deviation, the min and the max. And then this for the total payments and for the medicare payments. Here's the top ten hospital charges and payments by state, and it's ordered by the average total payment. So here we see that the most payments were for Arkansas and the District of Columbia.
David: 24:40 That's Alaska, Carol.
Carol McDonald: 24:42 Oh, DC is Alaska? So then this one is in the top hospital charges and payments by diagnosis and state, so here we see that, so we're grouping this one by the diagnosis definition and the provider state, so here we see that the most expensive one is the heart transplant in California, and this is then ordered by the average total payment.
Carol McDonald: 25:19 That one gives you a lot, but here we have a different way of looking at it. So here we see that, this is grouped by California, so here we see the highest one was in California with the heart transplant. So we were just looking at that one. This is showing the statistics for the average total payment. So you see the min and the max, there's a large distance between the two.
Carol McDonald: 25:58 Here we have the top average hospital charges and payments by diagnosis ordered by the total payments. Here we have the heart transplant has the highest total payment and the counts. So then the highest was the heart transplant, then heart assists, ECMO, extensive burns or frills. So that's average hospital charges by payments. This one then is showing the top charges by diagnosis ordered by the total payment. So here we're grouping by the definition and then ordering it by the average total payment.
Carol McDonald: 26:54 Then here's the most common diagnosis by number of total discharges. So the most common one was major joint replacement or Sepsis, heart failure and shock. Here's the most common medicare discharge and this is by year, so 2013 it was the major joint replacement, here and the next one was Sepsis again joint replacement Sepsis. Here we're seeing how it changes by year, Sepsis and then 2016. So here's the statistics for the payments for the most common diagnosis Sepsis. So this is using the same diagnosis code but seeing what the average total payments are. So you see the average was 13 thousand with a standard deviation of three thousand dollars. And the max was 58, and the min was seven thousand.
Carol McDonald: 28:04 And here is the most expensive state just for Sepsis, which shows that the most expensive was Arkansas and then Maryland. And here's another bar chart for the most common diagnosis by total discharge, which we see that 871, which was the Sepsis was the most common.
David: 28:34 Hey Carol. This is David.
Carol McDonald: 28:38 Yes.
David: 28:38 We are coming up at the 10:30 time, so there are a couple questions that I wanted to try to squeeze in here, if that's okay with you.
Carol McDonald: 28:49 Sure. Yeah. I was just finished. I was just gonna show where you could download it next. But so the site will be provided afterwards and you can download the codes and read more about this from this URL which will be in the slides. So there you can download the code. And then also just wanted to mention that there's a new Spark Ebook that talks about Spark and data stuff like that. And then there's some other Ebooks that you can download about streaming architecture and streaming logistics. And then a few more resources. We have this Spark 2.0 training, and then the MapR blog. So yeah, that's all I wanted to talk about. And these slides will be provided. Now we can go to any questions.
David: 29:41 Yeah, just a couple here, and you can see them Carol in the question tab in your browser. So one question that came in is the Medicare data available without the patient information for analysis? If yes, where is it available?
Carol McDonald: 30:04 The patient data, where is it available? Yes. Speaker 2: 30:10 Yeah, so in one of the examples, you showed the medicare data and the patient information next to each other. And so they're asking if you could see the medicare data without the patient information and do analysis on that.
Carol McDonald: 30:28 Yes, well both of the data sets I talked about in addition to a couple of other data sets available on the CMS website, so the way to download that is also available in that blog that I talked about. So those two separate data sets that I talked about, they're both available online. Speaker 2: 30:51 Okay. Another question that just came in, is there some way to control the number of digits displayed after the decimal on zeppelin tables?
Carol McDonald: 31:04 Yes. Yes. You could definitely do that with UDF. I'm not sure about, well you could definitely do it with EDF. I'm not sure how to do it by default. But yes, you can do that. Speaker 2: 31:19 Okay. Speaker 2: 31:27 I'll let you take a look through those other ones, see if there's-
Carol McDonald: 31:30 I see one, one is about the performance if you do this in a large data set. So you saw that that one data set I had was five million rows, so with the spark database you see that it's partitioned across the clusters so this is running on, I'm not sure how many nodes, but you saw that the table split out across a cluster and more queries in parallel and then putting that also into memory in parallel with Spark SQL, so that you can then, and that analyzes the memory.
Carol McDonald: 32:04 So it just wouldn't be possible to do some of this with a typical relational database because the way relational database works is, generally with a relational database is you're gonna join the sets together and you're going to query it. And that's all based on indexes and the way that the data is stored in a relational database, it can't handle those types of numbers. Also you can't handle that type of data and memory unless you partition it.
Carol McDonald: 32:37 So there's a question about high level explanation about how to Spark versus Hive. So Hive was originally designed for performing batch queries and Hive runs mass produced in the background. And then Hive has been updated now. So sometimes you can use Hive with either, now you can use the Hive Nest or specifically with spark. So you can use your meta store, which is storing the table information for your data. You can use that with Spark and also perform Hive SQL with spark. And that's gonna do it, like we just talked about, in memory and it's much faster than performing these analyses with mass produced which is based on running these queries. It reads and writes the disks a lot, so that's not very fast.
Carol McDonald: 33:38 So that's basically, when you want to use Spark instead of Hive or with Hive SQL, it's when you want interactive analysis. You want to be able to do that faster.
Carol McDonald: 34:01 Is the time stamp provided with the Kafka API, and yes, now with the structured streaming, with the Kafka the latest API, you do have the events time stamped. So you can have time stamp that's part of when the message has been received and when it was processed.
Carol McDonald: 34:38 Are you expecting any to write these Spark in Zeppelin, or is there a more friendly way to get this data? Well if you're storing, that's a good question, so if you're storing the data in the MapR database you have different ways of then analyzing this data actually. So you can use Spark SQL. And another possibility you can use also Apache Trail which also provides kind of an interactive SQL engine. And the advantage of Apache Trail is that you can use this with JTPC and a lot of things like Tableau which makes it easier for analysis if you want to have data analysis ease. And so that's another possibility is to use Apache Trail with data and MapR database and tools like Tableau.
Carol McDonald: 35:35 And then the next question is does Kafka guarantee order of operations. So the events are ordered in the partition as they are sent, and then they're ordered also, you read them in the order that they were written. So in the partition, there's a guaranteed order of yes in that partition.
Carol McDonald: 36:08 And so then there was a question of does the consumer need to remember the offset? So the with Spark what happens is you're, Spark is processing data's offset to support you. And yes, if you want to start and stop, then you would want to save that offset yes. So the consumer is controlling the offset from where you're reading.
Carol McDonald: 36:43 So then there's another question. If you want to reduce free processing, where can the results of the intermediate aggregates be stored. So you have different options for that if you want to, so you can fill data pipelines, where you're, like I said, on one slide, where you can have different views of your data. You can have a view where you're sending some aggregates into a different topic for processing. Or you can save some to the MapR database or to a file. So you can have different views of your partitions and your data pipelines basically.
Carol McDonald: 37:35 And so you can read more about also the pipelines, I showed you the, did I show? Yeah, I showed you the two books that you can download about streaming architectures and logistics. That goes more into the different types of pipelines that you can have that you might want to have.
Carol McDonald: 38:00 Okay, so I think that's about all the questions, and definitely if you have more questions, then you can also email me later and refer to the resources that I referred to.