Publishers Clearing House: Performing Large-scale Analytics with Real-time Data


Gino Kelmenson

Director of Enterprise Data Systems, Publishers Clearing House

Jordan Martz

Director of Technology Solutions, Attunity

Ronak Chokshi

Product & Solutions Marketing Lead, MapR Technologies

Presented by Attunity and MapR

Join experts from Publishers Clearing House, MapR and Attunity to learn how Publishers Clearing House established an enterprise-grade, unified data connector to stream real-time, user-driven events into its new Data Hub solution (for real-time and batch-driven processes).

Learn how Publishers Clearing House chose MapR on the cloud as their platform and Attunity Replicate as their data ingest software to synchronize their real-time data lake with their mainframe transactional system. You’ll hear how the new unified data connector became the standard data connector across all of Publishers Clearing House’s online web and mobile applications. With all data in the same cluster, the real-time data is used to support additional use cases such as monetization engine, liquid offers management, fraud management, real-time scoring and more, which can be implemented with a single, authoritative source of data.


David: Hello and thank you for joining us today for our webcast Publisher's Clearing House, Performing Large-scale Analytics with Attunity and MapR. In today's Webinar, you'll hear how Publishers Clearing House established an enterprise grade unified data connector to stream real time user driven event into its new data hub solution for realtime and batch driven processes. Then you'll hear from Attunity and MapR on the solutions they work together on to enable Publishers Clearing House to do this. Our speakers today will be Gino Kelmenson, Director of Enterprise Data Systems at Publishers Clearing House. Ravi Jannu, Big Data Lead at Publishers Clearing House, Jordan Martz, the director of Technology Solutions at Attunity and Ronak Chokshi, Product & Solutions Marketing Lead at MapR Technologies. Our presentation today will run approximately one hour with the last 15 minutes dedicated to addressing any questions. You can submit a question at any time throughout the presentation via the chat box in the lower left-hand corner of your browser. With that, I'd like to pass the ball over to Gino to get us started. Gino, it's all yours.

Gino Kelmenson: Hey. Hi everyone. Very exciting to be here. We have a lot to present and hopefully you will enjoy the presentation. Publishers Clearing House, we are a leading interactive media company and we're offering a very broad range of products, digital entertainment and services to consumers. As many of you might not be aware that we were founded in 1953 by the Mertz family and we became to be known as a sweepstakes company who's famous prize patrol surprise that screaming winners on their doorsteps with the oversight checks. Some of those checks can range from $1,000, but $10 million dollars and everything is on TV. We have awarded over $350 million in prizes and greatly evolve our offerings through many different channels.

Gino Kelmenson: What is the key to our success? We're one of the largest and best free to play destination. We have a very large pile of very loyal customers across all channels. Effective data management is very critical to us. Our objective is to serve our customers with the most relevant content across all channels. Our customers trust us to deliver the relevant experience with keeping their data secure and safe. Maintaining the trust is our core objective. It drives everything we do.

Gino Kelmenson: Let's discuss some of the key factors. Strong sponsorship. In order to achieve success and as many of you know, strong sponsorship is the key and us being a data driven company, we are very happy that we do have a strong backing from our executive team to support all of our business driven initiatives. Resources. When you start a big data project, if you don't have the right resources, you might have an issue. Ensure that you train your staff accordingly and if you need support, connect with the professional services. As we're discussing here today about the Attunity and MapR, I would like to mention that we did receive the great services from both companies and it was actually one of the key attributes of our success in delivery. When you work on a big data, you cannot take any shortcuts from the ... When you define your enterprise level architecture, you must spend as much time as possible to logically define on what you're trying to achieve and how it's going to meet the business goals, so no shortcuts, stay focused and take the big picture into consideration.

Gino Kelmenson: Data security. One of the guiding principles for PCH is to keep only the data which is useful and relevant, purge the rest. When you bring an annual data and either it's the traditional BI environments or it's an operational environment or it's a big data environment, ensure that you apply anonymization and encryption. Other key component is that you really drive to enforce the data security policies which have been outlined by your internal compliance officer or data security office. If you have the ability implement monitoring tools which can help you to enforce your data security objectives. The same philosophies and practices that you operate under, you must extend to all third party vendors that you're working with because the chain is as strong as the weakest link. Another which is very important key factor is the data governance. I know in the last couple of years data governance really became, is one of the trending terms on the market and not a lot of people who understand what data governance is.

Gino Kelmenson: It's very important that when you look into some driven implementation, that you take data governance very seriously. Invest as much as possible into data governance practices. Based on our experience and based on the, what's the current on the market, it will pay itself many times over in a very short period of time. The other key factor is the data quality. If you don't establish the trust in your data and the processes, then it will be very hard for you in the later phases to promote your data products across the business community. Try to as much as possible to engage your subject area experts in depth, in security automation. Do a lot of data profiling and if you will follow all of the best practices around the data quality, I promise to you, you will do have very successful implementation and you will have a lot of trust from the stakeholders business community and you will drive a very, a lot of business value.

Gino Kelmenson: As I mentioned about the business community, there should be no silos. You need to always engage with business users throughout all of the project phases because everything it's so important when everything that we do and we work on the delivering some business product. The business has to fully understand and be engaged. They need to know what you're delivering, how you're delivering, and how it's going to meet their objectives.

Gino Kelmenson: Now, let's discuss a little bit about the Attunity, Hadoop data lake and what was one of the key challenges that we were trying to address by using the Attunity. Publishers Clearing House is, as I mentioned, we've been on the block for a very long time and we do have a very effective data management practices. When we started looking and our environment is as many enterprises out there, for many years, we were relying on the traditional implementations. You have relational databases, which you have with different type of implementations where you have oil to be layer, you have a data warehouse, you have the mainframe environment. Three years ago we didn't understood that in order for us to start facing new challenges and in order for us to be very successful we need to invest into the big data platform. We'll get to that in a little bit more about the actually deductal selection but one of the use cases was to bring the mainframe data into our Hadoop Data Lake. As we all know, it may sound very simple, but there is a lot of complexities that can come as the data is in a different format.

Gino Kelmenson: The data which is stored then for example, on the GB2 system and the mainframe might not be very available for your incremental data pools. You don't want to impact your production GB2 systems. By trying to implement and not using the right tools, you might have some very big challenges and it's very important that you make the right decision, what's the solution and how you're going to bring this data into our Hadoop Data Lake.

Gino Kelmenson: When we thought of to discuss how we can do this, one of the things which we have outlined as part of the goals is how we can reduce our development cycle, how we can impact the limit on the source system, would not want to spend a lot of time to be preparing our GB2 systems as an example on the mainframe to be able to support the data pools into our Hadoop Data Lake. Is there any solution in the market where we can very easily recognize the CDC and the data lineage when we bring the data in the Hadoop Data Lake? We want to know everything. What triggered the change on our sources system, on the GB2 what was changed. Now, how is it easy? Is there a tool that can allow us to very easily manage how we bring in this data, right? You don't want to be having where you need to spend a lot of time dragging certain scrapes and, and spending a lot of time on the development. You want to have a graphical interface with just Click and Load, which gives you different options for the different types of workloads. Do you want to do full load, you want to do incremental load, do you want to only add some columns or remove some columns? That was one of the keys and of course provide the secure an automated end-to-end replication.

Gino Kelmenson: When we outlined all of these goals, we started looking at the different vendors on the market and as for ... and at that time we didn't have a lot of time, right? We only had, I believe everything was done in just in a matter of two and a half months. We outlined a set of vendors, we review what capabilities they have. We identified Attunity as the front runner. Within the week as we identified, we were able to set up a very expensive POC. It was not just a POC where they came in and they would show you using the dataset, how can you bring the data? We actually did against our mainframe environment was to wrap this, the infrastructure and we were able to use the Attunity product as part of the POC to fully test all of the features that can be offered by the Attunity solution. Within two and a half months as we ... It was a great experience and within two months from the time that we started the discussion, we actually implemented the solution and start integrating into our Hadoop Data Lake.

Gino Kelmenson: Now, this slide I know it may be a lot of information, but here what we're trying to outline, when you have a CDC data, and let's just talk about the CDC data and why Attunity products actually was so easy to implement and not impact our backend system, as many of you know, when you say replicate, this emphasize that you're not actually connecting directly to the tables and in this sense to GB2. What you do, you actually read the transactional logs as the updates occur on your source system. When you look at your source system, there is a different types of datasets tables that you can have. You can have reference tables, you can have transactional tables, you can have summary tables and each type of object would require a different approach of how you manage this data when you bring this data into your Hadoop Data Lake.

Gino Kelmenson: The other beauty of the Attunity and the process that we have developed internally is that as soon as the change occurs on the source system, it becomes available in your data lake. Just imagine you run batch process on the mainframe. As soon as the batch process is finished all of the changes recorded in your transactional log, which is on the mainframe as GB2 journal then that data gets pushed directly to your environment. Now you have ETL jobs, which you can schedule them on the trigger. As soon as the data arrives, you start processing the data. Some teams may decide just to process the data on an hourly basis. Some may decide to process this on a daily basis. When you start thinking on how you should manage the data, you should really try to evaluate what are the different approaches you should use when you bring this data and how it's going to be processed.

Gino Kelmenson: Here at this stage we have about four different lists. Rather we have four different approaches, how we recognize the data and how the data goes through different environments in our data lake. One of the key factors, as I mentioned, data quality, as you can see from this picture, we have data quality which spans across all different layers in our Hadoop Data Lake from the source to archive. When the data runs, it goes through a very rigid data quality processes. When the data moves to our common model, it goes through the data quality controls and when it's finally ends up on our interpretive model, which is the final data product, it goes through even more steps of data quality. This is very important if you want to really have the trust of your business, that you have that golden state of your data that they can use and they have a comfort level that the information that they see is correct.

Gino Kelmenson: In regards to the data profiling, it's very hard to see here, but we do have a data profiles which is running between Hadoop and the source system in this case, as I mentioned, the GB2. In order for us to fully reconcile the data that there was no issues through the ETL, we set the processes together that will connect to the DB2 system. It will do the real-time profile against the data and then match the profile against your golden state of data in the Hadoop data layer. If we recognize the issues, we have the process in place which will prohibit the data move between the layer to the consumers, will sound the alerts and the team would start addressing the issue as soon as possible.

Gino Kelmenson: I believe probably here we can stop because it might take us a very long time to step through all of the different approaches. Me and Ravi would be very glad to discuss this either if you have any questions at the end of the presentation or feel free to contact us directly and we will be very happy to provide you guys with some of the learning experience that we gain as part of the implementation. Why we make certain decisions and how we're transforming the data and we can share some of those findings with you so when you guys start, you can avoid some of the pain points.

Gino Kelmenson: Now in regards to MapR, why MapR? A very good question, right? You have so many different distributors on the market? Right? Right now, when you do a search on the Internet, you want to look at the Hadoop distribution, you probably have at least five or seven MapR Hadoop distribution center in the market. Then I believe every six months there is a new flavor of Hadoop that's coming out, the new distribution, the new company you support. Three years ago, as I mentioned, so we have decided that Hadoop is the platform of choice that we would like to utilize for our big data needs. We started a very rigid evaluation across all distributions and we had a series of meetings, discussions, we were evaluating based on the materials available in the market, on the Internet, based on the direct communications with the customers that have distribution.

Gino Kelmenson: There was, it was a pretty rigid, evaluation process. One thing which is really, I believe, played a key role in selection of the MapR is because we felt that this is probably the best Hadoop solution on the market that is enterprise ready. When we say enterprise ready, is that it has the key definitions, what defines the platform as the enterprise that can answer a lot of- Ravi Jannu: data driven initatives.

Gino Kelmenson: ... Right, data driven, right, initiatives and it's not just in a sense that you can develop on top of it, but also from the high availability of sense, from the ability to integrate various data sources, ability to support multi-tenancy, ability to a point in time recovery and a lot of things which I outlined here on this slide, I know you can just go on the Internet and pull this together, but we actually use everything here that's been outlined, we currently use as part of our implementation. Some of these features play a very critical role on why we're so successful. Just the ability to do, point at time snapshots it's just been amazing capability.

Gino Kelmenson: You can apply the time series view of the data and when the analytics team is asking guys, we want to have an ability to just drill down, not just for a specific transaction of what was the state of the user at the time that the transaction occurred, and we want to adjust on daily basis, really see that transition. You can apply this across your whole data layer. If you take the snapshots on a daily basis and you have ability to directly query those snapshots, this is just amazing, it provides an amazing value to us.

Gino Kelmenson: Now, from the data mirroring, that's another very key factor is that it's really done underneath. It's not impacting your performance. You can set up another cluster where you want to mirror the data and it's done very seamless and level and it plays a very good role.

Gino Kelmenson: Now, so as you can see, like I said, so many different factors. Again, we'd be very happy if you guys would send some of your questions and we can provide some of our evaluation decisions on why MapR was the platform of choice. Now, I will pass this guys to MapR team. Ronak Chokshi: Yes, thank you Gino for the really good description and your perspectives. It was really, really good. I will now describe MapR's offering in the platform and explain how we are helping customers like PCH achieve business success. All right. Ronak Chokshi: If I have to summarize the MapR story on one slide, it will be about all data, one platform, every cloud. They're uniquely positioned in the market today to facilitate enterprise applications, existing interactive exploratory and the more advanced, all of them. It's all the data that's available across all the silos throughout the enterprise, right? We handle the diversity of data types, compute engine, all of that is needed on one converged data platform and we make the platform available on multiple clouds making it the no lock in choice for creating a global data fabric as you can see here on the slide. We deliver all of this with speed, scale, and the liability needed for business, critical environments for customers like PCH. All of this combined is in our opinion, makes MapR Data. Technologies very unique and successful in the market today. Ronak Chokshi: They've been around for almost a decade now and have had really good successful track record with customers across many industries that you can think of, most industries that you can think of. We like to think ourselves as the data technology company that allows our customers operationalize their data assets, build smart applications and improve the business value. It could be reducing the development cycle as an example here for PCH, or reducing fraudulent medical claims for leading health care company or be the largest biometric database of the world for the other project in India that's now scaling at more than $20 billion biometrics pertaining to more than 1 billion residence in India. Then our customers actually look to tap into the data stored in silos as well as have a secure solution to ingest real time streaming data, we become the data platform of choice and power that existing and next gen applications as I mentioned earlier. Ronak Chokshi: The alternative for them really would be to have siloed data and siloed applications which is very inefficient at the scale that some of these customers that you see on the right operate. In a nutshell, our software data platform is the fundamental underpinning that enables them to operationalize data and extract business value from it. Ronak Chokshi: All right, so this isn't exactly new news, data is growing at a multitude, at orders of magnitude, larger year over year. 30 years ago, we thought we had big data, but that was three Exabytes of data, almost 25 years later, we had 300 Exabytes and then 2 Zeltabytes and so on. Obviously growing exponentially. There are two interesting things about this data, right? In the recent trends, A, the diversity of data, there's machine data, there's sensor data, there's IoT data, videos, social media, and it's endless, okay, and the most interesting thing about data now is the business models are changing, right? Every company starting from smaller organizations to the Unicorns of the world, Uber, Airbnb, and so on, their business model for centered around data. These companies differentiate themselves through data and analytics processes and strategies. Ronak Chokshi: All right. The biggest trend that favors MapR in this journey apart from just the explosion of data is the re platforming that's going on that's in progress, but as we dismiss it. This is being driven by diversity of data and processing engines and all of that that I mentioned earlier. Frankly, this simply just cannot be done with that older architectures. The challenge at every company, the VC that we witness is how do companies harness this data for their business advantage as PCH is doing here? Moreover, as the organization grow, their silos also grow. The number of silos also increase, so the infrastructure and storage providers that you see on the left-hand side here have a lot of current and last generation technology, previous generation software infrastructure installed which is great. They have also tried newer big data technologies, but it'll be fair to say that none of these alone or even as an integrated approach has really helped achieve success with the newer data trends that we're seeing for our customers in the market today. I mean, they have relational databases, IoT sensors, there are lots of sensor data sitting at the edge, streaming data, big data, and all of it sits in that idea of silos and hence obviously difficult to integrate, analyze, process and obviously, hence really unable to solve any business challenge effectively. Ronak Chokshi: MapR, we would like to think that we were built with a vision of converging the essential data management and processing technologies on one single platform for every cloud that you see on the slide here. Ronak Chokshi: Oh, with alternative technologies that you see here on the left, we believe that you will experience what we call a crisis of complexity. Really, alternative offering attempt to stitch together multiple point products, and essentially result in lower performance, limited scale and increase your cost. Essentially our messages don't learn the hardware that's connected and federated to provide the same level of performance as convergence. We believe that our system converged data platform is the only solution engineered from ground up to seriously work with all your data type of today and technologies meeting all your business needs. Ronak Chokshi: All right, a global data fabric for us at MapR means a few things. Number one, it means being able to ingest of that idea of data from ingest data from a variety of databases ranging from mainframes like PCH is doing here, from private clouds, public clouds and multiple of them. It means being able to store mega data in MapR files, tables, streams in the same cluster. Then more importantly add to that, the ability to do in place analytics and machine learning, so not having to move data around to facilitate advanced data mining, right? Customers are again, saving the cost of not having new applications with the existing data. It means being able to use interfaces like POSIX and NFS or applications can write data directly into the cluster. Last but not the least, it means organizing data in volumes and then leaving multi-tenancy data security and governance, and you heard Gino talk a lot about this, so that you can apply different data production strategies to these data types of data plus across clusters for high [inaudible 00:31:38] and faster disaster recovery. Ronak Chokshi: Finally, here's a high-level diagram of our converged data platform. We essentially combine the capabilities of working across every major cloud provider, as I mentioned earlier, as well as edge nodes, ingestion of all data types imaginable with HADR and as well as Global Namespace, which is instrumental in mapping all those data sets that we talked about at the source. The top of the diagram shows our ability to feed all these available datasets into existing applications, new applications that help analysts build BI reports and help them explore data as well as help our customers build brand new applications by leveraging machine learning, AI and things of that nature. Right? This is how we become the data platform of choice for our customers and accelerate their business transformation journey. With that I'd like to pass it on to Attunity, Jordan from Attunity for their portion of the slides. Thank you.

Jordan Martz: Perfect. Thanks to both of you guys, Gino, Ronak and others, all the rest of the folks involved. When I look at it, it's an incredible story of how integration of different tools and different pieces and an understanding of what they want to do and they really set out on a vision. It sounds like they started with a clear understanding and as they evolved with the right tool sets, it started to become part of the overall story of, what did they want to build? When we talk about data lake design of Attunity, one of the things that we do that's very complimentary to the message of MapR is that we're sourcing in real time and they're sourcing on a global scale.

Jordan Martz: Three slides ago, there was a picture of the globe and being able to make sure that all your systems as they're integrating into a data lake, that they all can be coordinated down to not only the minute and the timestamp that we have on our phones and where we are, but out to the ability to make sure that as things change, they're actually understood in a very clear way so that each part of it can be set and organized and all real time applications can then be coordinated back to operational. That's where we fit.

Jordan Martz: We're in that operational data that's core, the operational databases, the mainframes and this use case, the mainframe. How did that fit? In the use cases of databases that we've moved, we've moved over 40, I think it's now 50,000 databases in many of the largest clouds in the world as well as many of the on-prim inside your own data centers across the globe. When we look at this, this is designed to give you an easy developer experience when you're looking at it from a developer, you want to be able to point and click and move the data in a way that the complexity is not there and the ability to deliver it with time to value is the paramount facet of what you're trying to do. We have a number of different key products. We've got the replicate tool.

Jordan Martz: The replicate tool takes and automates the data delivery, composes that second step of that when you take and generate, for instance, hybris spark the logical ways to ingest that data to be able to generate and load into the MapR cluster. Containers have been optimized, and in the code to get it into not a data warehouse, but to get a replica of what was in the source and put it into the destination. You can generate all that logic using compose.

Jordan Martz: Then furthermore, we took enterprise manager and we built a world class and it doesn't exist in the industry today. It's a unique tool that gives you understanding of how people are using the source systems. It tells by usage pattern and by analysis how things are going, but the core today is how they evaluate it and the team from Publishers was looking at different technologies and we look at the different source technologies, two things you consider. It's the ability to move the data and it's the ability to properly connect to it. I think often times people flood to CDC based on the ability to not impact the source, but also the ability to get it there. If you're distributing across the cloud or across a number of different locations, what you want to be able to do is to be able to handle that transformation and the ability to migrate that in an efficient way and that's part of that intermediate zone which is file channel and in-memory.

Jordan Martz: Now what you're looking at is the ability to load into MapR from both the MapR Event Store and the MapR file system and how that integrates together to deliver this result. One of the things that you can look at on the next part is all the different sources and targets. It's not just the mainframe itself in this use case, it's Oracle, it's SQL server. It's MySQL and Postage SQL, having a universal way of all these different systems. The way that the customers, when they're using us, they're looking at it from their EDW, their operational systems and how they can bring that to an IoT use case and combine that and then maybe start to look at what to offload from their existing EDW or what from their operational SAP systems are going to be running and what applications are already built into the cloud. How do we figure out what's hot and cold and move it over? This is part of the overall system that you look at as the ability when you're loading MapR to be able to handle different types of sources and sort targets.

Jordan Martz: With MapR we also load the streams database, which then is an easy way to get into the MapR RDB and very seamless when you're using Trill and other tools in that ecosystem to make a very quick delivery of value when you're being able to use that clip stream to be able to interact and to bring that data together.

Jordan Martz: What we have right here is these different systems and how to integrate them together. This is a diagram of how we can do either bulk loads and we've got a highly parallelized bulk load component that allows you to ingest part of that diagram that we saw earlier. Furthermore, it's the ability that once you're streaming data, you can process it very efficiently, similar to what Gina was talking about with his views, but also the ability to start as you're bringing it in, incrementing it together, and that was the other part of it.

Jordan Martz: Furthermore, you can merge that view back into the MapR RDB, it creates more logic as you're going. All of these are systematically integrated and very one tool fits very nicely with the other and symbIoTic in nature. As you look at the different types of tools and the types of ways that you integrate not only is DB2 the mainframe and part of the foundation of all this. When you're looking at mainframes, you're looking at the way that you connect. Oftentimes a lot of people can relate to a relational world where they have transaction logs but in the journaling process, the transaction logs in the mainframe, that journals. You have the I series the ZIP which have different facets that you need to consider.

Jordan Martz: Further, DB2 has some history with VCM and IMS and we handle for all those in the historical mainframes themselves. In this use case is DB2 I Series, but there's lots of different forms of how to connect and how to make sure that impact on the mainframe is as light as possible. As you're looking at that light as possible impact, look at the different opportunities of how efficiently you can extract and get that latency to a near real time scenario and that gets you to that use case we talked about because a lot of historical key product bombs build the materials, all the different key things that you need to do to drive that becomes part of the overall sharing of real time information.

Jordan Martz: Lastly, before we get to questions, we're taking it into the data lake and as they've been sourcing the different DB2 data and running the scoop, I was thinking what we do is we ask Gino, we came back to this diagram.

Gino Kelmenson: Right.

Jordan Martz: Do you have any more comments on this?

Gino Kelmenson: Yeah. This is what again, just going back to the comment I made, how important that is to really think through your overall architecture and at the same time, if this is something new for, and it's big data is even though it's been around for some of the companies, it maybe new so it might not have a staff that, again they need the experience to implement such a large solutions. Engage your professional services because putting the right architecture in place and as we can see here you have the very defined layers within your data lake. You don't want to have a swamp and you want to make sure you follow the best practices. You need to make sure that you build ETL pipelines and ingestion solutions that can meet the demands of, as you said, constantly growing data because on your right side you have a lot of business solutions that have a dependence at right now on this big data platform.

Gino Kelmenson: If you do not architecture this with the right approach, you will have a lot of issues. As I mentioned this is age of the data driven company and in just the last couple of years, as we know, as we engage on this road of driving towards using the big data across the enterprise, we see here that we already have a lot of dependency of what we build in house by our advanced analytics team, by our Omni channel solutions, by the business intelligence. We have automated systems which is using the data and we have a lot of users which is on a daily basis in order for them to achieve their or to perform their daily task, they are depending on having this environment up and running.

Gino Kelmenson: In order to do that, select the right distribution, select the right ingestion tools, make sure you get the staff up to date, make sure your staff embraces the new technology, and again, very successful here at this age, all the internal staff embraced it and we were able very fast to convert them to start using for example, MapR, right. They became from SQL server application developers, they became big data developers. There is enough trainings available online, or you engage with the distribution of your choice. They also offer the training, so it's a very complex but at the same time when you do it the right way, you will gain a lot of business value. That's what we try to outline here. That's how PCH was able to deliver this very complex but at the same time very streamlined architecture. I will pass it back to that. I get to the questions if anyone has any questions.

David: Great. Thank you Gino. Thank you Jordan and Ronak. We do have a few questions that have come in before we get to those, just a reminder, if you have any questions, please submit them in the chat box in the lower left hand corner of your browser. All right. Jordan, let's start with you. Can you tell us more about continuous data capture?

Jordan Martz: Sure. Continuous data capture is a concept when customers, let's just make it real generic. You're clicking on your website, you're trying to submit something. If the impacts that you're doing it by, and you'd multiply that by 100 people all trying to get the same transaction, that causes a lot of contention or a lot of problems on the operations of putting data in and I call it pushing and pulling. Right? If you separate that latency out and you want to be able to build it into a couple of moving pieces, one of the things you would use as a tool like the CDC production tool, they would not impact that operation of putting data in, pushing data in and when you're pulling data out, you want to be able to do it with the lightest footprint. That's what CDC does at a core basic level, but as you're looking at it from a mainframe, you've got to be able to handle a lot of distributed nodes.

Jordan Martz: There are distributions just like in MapR with multiple nodes inside the mainframe and when you're moving across MapR, which is one of the divisions of how they divide up one of these things, you want to be able to have that operation handled, the journal is the legs of the transaction log and the way that that data is collected, so you want to make sure that when you're collecting that data that it's being brought together efficiently and that those types of data and the way they define that data is brought into a very seamless process.

Jordan Martz: When I look at CDC on a mainframe, it's more complex than what it looks like on the outside and having that ability to make it very efficient and understood comes back to our history. Now, we've been around about 22 years. In the first 10 we were helping people migrate technologies throughout mainframes, so we've been doing that a long time and that's one of the key things.

David: Great. Thanks Jordan.

Jordan Martz: Okay, go ahead. Yes.

David: Let's see. Okay. Let's go to, can ... I'm going to throw this out to the whole team I think it's probably a Gino question though. Can you discuss the data comparison reconciling between DB2 and Hadoop in real time? When you said Golden State, is that the source raw or common model?

Gino Kelmenson: Okay. All right. It's actually between all of the layers, right? Let's talk about how we reconcile, right? Within close to the real time and when we say real time, I just want to emphasize that it has a different meaning for different teams, different implementations. From our perspective, real time is at any point that we want we can get the data profile from the source system, right? You can implement and there's very different ways of how you can implement this, right? You can have the process which you involve from your Hadoop Data Lake, which works the store procedure from your DB2 system or some kind of a profile information and then compares this profile against your Hadoop Data Lake, right? If you're managing your ETL, at any point of the ETL, you can involve this profiling function and the function will give you the pass it or fail. Right? Different test cases have different thresholds, right, and dependent on how your data governance or data quality teams set up thresholds. If you pass the thresholds, you move along. Ronak Chokshi: Not only that, we have a different strategies, so for how we're dealing with the data, like we categorize the data, classify as a simple , moderate and complex so that based on this what needs to be applied, whether we need to have a full refresh, or full reconciliation of the data, merge or partition, refresh, so we will follow different strategies, not, the same strategy cannot be applied to all the data set. Dependent upon we have Microbiota as well as some of them are near real time so that it could be, depends upon the business request. We will be publishing results to the final year.

Gino Kelmenson: I think there was a question. What was the question about the source archives? I think the question was how we compare the data. Now, as the data moves between different layers. Again, the same approach. You have a different automated script which validates to make sure that you map the data types. The content is correct. You mentioned on accounts, that your default values is set properly. There is, again, we can have a separate discussion about different types of data quality controls but the objective here is as the data moves between different layers that you constantly handshake in between each layer. Right? Then it passes your automated quality controls.

David: Okay, great. We have a lot of questions here so I'm just go through them pretty quickly. Can you please elaborate the thought process behind putting TIPCO in a previous slide in the left hand diagram? Was that you Ronak? Ronak Chokshi: Yes, that was me. TIPCO is an infrastructure software company, right? They have real time communications aspect to it and they have some really good middle layer and so on. Again, when it comes to a couple of things, so things such as data governance, things such as converging data from silos, being able to do machine learning and in place analytics, things of that nature, it is our opinion and belief that they fall, they wouldn't be superior in those areas, and that's unique to our platform. That's what I meant.

David: Okay. Thank you Ronak. Two part question from the same person. On which infrastructure is the MapR cluster running for PCH? Is it cloud? Which cloud service? Then the second part is, what were the decision factors for the MapR underlying infrastructure services choice?

Gino Kelmenson: If you don't mind, I'll skip the first question. I just wanted to make sure that I'm not sure certain things they should not be in regard to where it's hosted, to what infrastructure. In regards to the why we decided to go with MapR distribution, I mentioned this in a prior slide is the enterprise features, is and again, the core of it is the NFS, right? Is the file system that MapR promotes, it's different from your typical Hadoop, of your other Hadoop distributions and by having their own file system, they were able to promote so many very powerful enterprise features, which were really the driver behind our decision.

Gino Kelmenson: Again, the stability, high vulnerability, ability to mount in the fast lane of your external systems, ability to ingest data very easily as I said, point in time, recovery snapshot is just an amazing feature, the disaster recovery components. I can be going on and on, but I believe each company as they really go on the path of selecting the distribution, they should do a very strong evaluation, for this age, what I just mentioned was one of the key attributes of our selection.

David: Great. Thanks Gino. Let's see. Jordan, this one is for you, does Attunity have a data quality, data governance component? If it is, please can you walk us through with one use case?

Jordan Martz: Sure. There are different tools in the market when you are applying some data governance and data catalogs. Whether I can use the cloud or on prem, but I think there are different partners that MapR partners with in this use case, like we've used a collaborate on one of our customers where they integrated into the ... So we've taken and added our metadata which our data governance comes inside of an enterprise manager. Our enterprise manager gives you collections of number of records, philosophies, all those things. That's a very basic level of what activity is going on, but then as you're looking at lineage and you're looking at data governance from the terminology of the mastery of the data and moving into that governance aspect, we've worked with many other partners in this space.

Jordan Martz: There's even different tool vendors and other ones. In this particular use case that you wanted me to go through, we load it into a partner ETL tool and their data catalog. We integrate it in with multiple data governance tools you'll see on the market today, like three different ones. They all use connected via rest API coming out of our enterprise manager. That was you could use our tool for what it is and the DVAs loved that tool to be able to understand what's happening at the source and what's happening across the ecosystem but further loading into the partner ETL and both the tools in the market. We were able to do that all via very easy to use API, rest API. I mean, that literally just exposes that Jason Data, not even that big of a deal. Hopefully that helps to answer that question, but if you have more of what we can continue.

Gino Kelmenson: All right, if you guys don't mind, I can add another item at least one from PCH's perspective. You cannot call it from PCH's perspective that Attunity is a full blown data governance solution, right? As you said, it has a lot of integration with the other solutions from data governance but having the door such as Attunity and again, data governance in different company defines the different things, right? Just having the tool where you can go and through just one simple interface and we're just speaking on our behalf. You can see all of the objects that you are bringing from the source system in a very simple way, how many records you brought, what columns and attributes is included, and also ability to filter and exclude things that you do not need, I think is one of the definitions of the data governance, right? Because you want to have your data governance person having the ability to just login through an interface and see everything which is coming through the pipeline. I'm just want to add that piece.

David: Great. Thanks Gino. Jordan, another one for you. How does Attunity connect to various clouds and puts the data into S3, Google, Azure storage layers?

Jordan Martz: Well, I have a whole slide for that and if you want to contact us directly, I can walk you through that, but I think what's really paramount, what's interesting is the way we load into Blob, S3 and how we're in memory file in there. We're closely partnered with Amazon and Microsoft and their cloud migrations and this helps also with loading into MapR because of the way we loaded cash into the three, we were the only file that's truly in a memory file and some of those use cases. I think that gives us tremendous capability when you're looking at how we're going to be able to pull into a cloud use case. Yeah.

David: Okay. Let's see. Another one that came in is what products are used as the storage layer? The MapR, NFS file system is rewrite, so offers a tremendous advantage, but is that what's being used?

Gino Kelmenson: Yes. Is that a question to PCH or the question to MapR or Attunity?

David: For you Gino.

Jordan Martz: Listen for PCH I think, Gino

Gino Kelmenson: Yes, we are using the MapR file system.

David: Okay. Let's see. Gino, here's another one for you. Where do you store your business metadata such that your common model and interpretive model is known?

Gino Kelmenson: It's stored in a hive metal store, so we're using hive as the data processing engine.

David: Okay. Guys, I think that's all the time we have today, I know there are a number of additional questions. If we did not get to your question, we will try, we will answer them and get back to you offline via email, but I want to thank all the presenters today, Gino, Jordan and Ronak for putting together the slides and taking the time to present. I think this was a very helpful presentation.

David: Thank you everybody for joining online. Watch your inbox for an email in the coming days with a link to the recording and for additional resources on similar topics to this, visit Thank you again and have a great rest of your day.