Machine learning (ML) and artificial intelligence (AI) enable intelligent processes that can autonomously make decisions in real-time. The real challenge for effective ML and AI is getting all relevant data to a converged data platform in real-time, where it can be processed using modern technologies and integrated into any downstream systems.
Running a business in real-time means being able to react to important business events as they happen. Applications that support day-to-day operations, however, are often scattered across the organization making it difficult to enable real-time movement of data.
In this session, MapR and StreamSets will discuss how change data capture (CDC) can be used to enable real-time workloads to drive success with ML and AI. You’ll see live demonstrations of technologies that enable CDC, and specifically learn how to:
David: Hello and thank you for joining us today for our webcast Enabling Real-Time Business with Change Data Capture featuring StreamSets and MapR. Our speakers today will be Rupal Shah, solution engineer from StreamSets, and Audrey Egan, who is an engineer for MAPR technology. Our presentation today will run approximately one hour with the last 15 minutes of that hour dedicated to addressing any questions. You can submit a question at anytime throughout the presentation via the chat box in the lower left hand corner of your browser.
David: With that I'd like to pass the ball over to Rupal to get us started. Rupal, it's all yours.
Rupal Shah: Thank you. Thank you, David. Hi everyone, very nice to be chatting with you today. My name is Rupal Shah, a solution architect here at StreamSets. Been at the company for over one and a half year, but very much familiar in the space of the [inaudible 00:00:59] looking through StreamSets capabilities along side the MapR converged platform. And I'll show the new capabilities that are coming up with CDC functionality. Here, Audrey.
Audrey Egan: Thank you Rupal, my name is Audrey Egan, and I'm a solutions engineer at MAPR. Going to start our conversation today and lead into Rupal's demonstration of MAPR Change Data Capture with StreamSets.
Audrey Egan: So to start off, some of the topics we're going over, we'll talk about what it means to have a real time business application and what that looks like with MapR 6.0, utilizing the change data capture feature that we'll demonstrate today. Rupal will go into how StreamSets is enabling that functionality and then we'll wrap up the questions.
Audrey Egan: So currently this is a common picture of complexity that we see today. Typically multiple operational databases or sources of information are coming in via cloud, via [inaudible on-premise, and they're all converting into some kind of ETL structure and then outputting to an analytics to provide more meaningful information for end-business users. And typically this happens in a batch fashion and that was okay, but business users and customers expecting more real time access to information. They want graph, they want charts, they wanna know sales when they happen for example as opposed to the next day.
Audrey Egan: You'll see that some of the challenges with traditional warehouses like we just saw, one definitely are the increasing cost. These can be costs based on licensing, these can be costs based on proprietary hardware, it could be the inability to scale out a vertical scale as opposed to a horizontal scale, or something that doesn't meet current capacity but is budgeted based on a future capacity.
Audrey Egan: It could be just a lot of unused data as well. It may be too expensive to move the data, you're not making use of all the data sources that you currently have today. So there's just a lot of unused monetary potential with your current data sets. And other issues as well. Capacity, different data sources, and different data type.
Audrey Egan: So switching to the way MapR sees a real-time analytics application taking place, we can start out five cluster with an operational database like in Oracle and then invoking a change data capture, pushing that into MapR platform, leveraging say, a MapR Database CDC, to control for real-time update, and then allowing that to go to your business application to influence real-time decision.
Audrey Egan: So what MapR offers in that new world is MapR Database. So you're listening now to this presentation at a pretty exciting time within MapR, we continue to drive innovation at its platform level and MapR Database is an example of that, the multi-model database, it's a major component within the converged data platform. You can see in addition to the cloud scale data store that we have, the real file system structure, we implement as well the database component on top. This is no additional process to manage, it leverages the same architecture as the rest of the platform. Very minimal additional overhead but the bottom line is you receive all the benefits of what the platform has to offer.
Audrey Egan: For example the high availability, the unified security, the disaster recovery, a global namespace. So these are all benefits of the platform at the file system level that extrapolates to the database and to the global streaming as well.
Audrey Egan: So just a little bit about MapR Database is, and what MapR Database 6.0 has to offer you, like I said it's the distributive database so when we talk about scale, MapR Database allows for that horizontal scale. So add an node, add capacity. MapR Database is JSON and binary database, so it allows you to store data that's nested and it can involve over time. You can read and write into those individual document field, without having to update the entire document for example. Or you can build java applications with MapR Database JSON API library. We offer many different ways to access data within MapR. So for example in this diagram we have OJAI APIs, HB APIs, the JDBC/ODBC drivers, there's different Hadoop and Spark connectivity as well that we'll go into and CDC is another way to access data within the MapR database.
Audrey Egan: One of the functionality that we're proud to announce with MapR Database 6.0 is the interaction of secondary indexes. So secondary functionality is built from MapR Database JSON. It's a very flexible and efficient way to query on to many fields. As it stands today, documents in MapR Database have some form of an underscore ID field, and if they query on this field, then it's easy enough to retrieve data, however many queries are searching for documents based on the non-primary ID field. This is where the secondary index comes in to allow you to create indexes on this field.
Audrey Egan: And in addition to those indexes to have covering indexes as well for more efficient queries as well. So you can see on the right hand side, this is just an example of a query that would leverage secondary index. Traditionally, the on the left the primary table has the underscore ID and a couple extra columns, and then on the right you can see that we've created an index based on the age, and based on the state. And then at the index covered field labeled activity. That would be a covering index. And then the example query at the bottom, you can see we'll leverage that index field based on the operator's life, where and or.
Audrey Egan: Some of the other functionalities like capabilities of composite indexes, different data type, has and non-has, those are all different indexing functions that you can leverage based on your queries, based on how you want to access that data. All these indexes as well update themselves, so if you change something within the table, that means the index will automatically, asynchronously be updated to the index table of them. So that's something that you don't have to manage.
Audrey Egan: You can also see based on this example index. You can index on any column, family, and they can be created with mostly scaler type, like, merits, like dates, like timestamps. At this time arrays are listed, but maybe in the future.
Audrey Egan: So I'm [inaudible 00:09:49] a little bit about the Spark Connectivity as another source of access to data within MapR. At the top you can see that Spark is that transformation of processing engine, that's kind of become that new standard. When you're looking at development after, for accessing features, and you can use Spark as well for MapR. What we mean by that, we have binary connectors and JSON connectors to MapR Database, so it doesn't have to be just MapR Database JSON. These connectors allow you all the normal integration points with Spark but are very helpful for moving data or writing applications that make use of data within MapR or between MapR and other sources.
Audrey Egan: This could be from MapR Streams (Now called MapR Event Store), to MapR Database. It could be from MapR Database JSON to another MapR Database JSON as well. It also allows you to create programs with the OJAI Connector and MapR Database JSON leveraging SCOLA.
Audrey Egan: That brings us to what is MapR change data capture functionality. So you can see that all these functionalities that we've been addressing, the secondary indexes, the Sparks connectivity, these are all access points, and these are all new ways to access data within MapR Database. So change data capture can be one of those. This is a third way to get inside into your data. It allows you to consume all changes that happen within MapR Database tables. The way that it's structured is that we've used our own technology, MapR Stream to catch, so to speak, the changes that happen in MapR Database. So MapR Database changes will write, and are the only ones that can write to the MapR Streams (Now called MapR Event Store) component.
Audrey Egan: And then because this is a MapR Stream, it has all the available benefit of a MapR Stream within the MapR platform. You have security, you have replication, and you have the ability to push this information as well to different consumers. Maybe at the Remote MapR Database table, maybe it's another stream processing engine, maybe it's elastic search. And we'll see how in the demonstration today how that plays out.
Audrey Egan: Lastly, the access point. Drill for MapR can not only connect to MapR-FS as you can see on the upper right hand side, but also to MapR Database. So this is a single sequel query engine that can access those files and tables. So based on your MapR cluster, it's distributed in memory query engine meaning that you can run the MapR Drill bit on all nodes in your clusters making use of the that horizontal scale. It has optimization for leveraging the secondary indexes and also the tablet distribution to avoid hot spotting and input/output.
Rupal Shah: Great. Alright, thank you Audrey. Let's start from the top of StreamSets fit into this world for the MapR converge platform alongside being functionality for MapR Database CDC. StreamSets as you can see basically a continuous ingestion platform to do both continuous movement of data as well data operations for the platform itself. In the traditional ETL world so to speak, it's common to have data sets or data movement flows being catered to very rigid structure, schemas as well as specific infrastructure which catered to more traditional data movement itself.
Rupal Shah: In the big data world, this is not exactly the same. Meaning you get data coming from every different possible scenario of origins whether it's simple files tables but also coming in from log devices, I/O type sensor devices, different types of streams, and network data and so on. So with traditional ETL tools it's really challenging to keep up with the different structure that come in play alongside the different infrastructure as well.
Rupal Shah: So where StreamSets fit into this world is we enable data movement in the big data ecosystem such that a user doesn't need to worry about the exact structure, the exact semantics, or the infrastructure when building a specific data flow. The way StreamSets works is there is the term that we refer to as called Data Drift, wherein you can design a data flow without having to know the schema. So for example if you want to read data from a file, whether it's simple CSV data or it's even JSON data, or Avro data, since that can automatically parse that data set and push it directly into the MapR file system, or the JSON tables and so on.
Rupal Shah: So it's very flexible in that you create any type of flow, we cater to changing semantics, changing schema, such that if there is a change in the incoming data set in addition field is added, the data type is changed as well, since that can automatically handle that. It can also modify a user when those different types of drift occur as well.
Rupal Shah: There are two different products in the market that StreamSets provides today. One is the data collector which we see on the left hand side. This is the open source tool for building data ingestion pipelines as well as processing data which within the platform as well. This is primarily used by let's say a data engineer, data scientists, even business analysts who want to get their hand to get data out of some type of system faster and help them develop their report as well.
Rupal Shah: On the right hand side, is the data performance manager. This is the enterprise version of the product which allows you to basically keep a control center for multiple pipelines, multiple data collectors that you have in different environments as well. And basically it access your air traffic controller for managing deployments across different environments for the different pipelines. It allows you versioning HA capabilities for pipeline fail overs and has wee more capability that again, I'll show you within the demo piece itself.
Rupal Shah: Think of the data collector itself as the data plane where data movement is gonna happen primarily. And then moving towards the right side from the control plane, you get the governance and lineage information of how the data's moving. Not just point-to-point but in its entirety for the entire application as well.
Rupal Shah: Now following up on what Audrey showed, in terms of the different sources, the different capabilities which enabled moving data into the MapR Converge platform, you could pull data out of whether it's files, whether it's relational data via Sqoop or it's APIs coming through, Kafka APIs for example, all these could be done automatically with other external scripting, maybe Sqoop jobs and so on. Or directly pushing data directly on the NFS, which MapR file system can read of of.
Rupal Shah: StreamSets also caters to enabling piece data movement automatically by not having to write any code as well. There are connectors that can read data in its continuous fashion whether it's coming from different directories, different files, or it's coming from relational data and their various capabilities to read out of relational data keeping it continuous data movement in check. With the MapR converged platform as well.
Rupal Shah: ... data movement in check with the MapR Converged Platform as well. So it fits really nicely for getting data out of the MapR Converged Platform into it as well as if there is data already on the platform itself, StreamSets can also read from it, do simple transformation, do aggregations as well as push data back into the ... whether it's other ecosystems within the platform or it's outside the platform. It can handle both of them.
Rupal Shah: For change data capture, assuming you may have different relational databases that are enabled for change data capture, StreamSets caters to reading from those type of relational sources with its in-built change data capture functionality into the converged platform. It could push data directly whether it's to MapR Streams (Now called MapR Event Store) or into MapR Database both binary and JSON, or just MapR file system as well. It also caters to working with Sparks Streaming and Spark out of the box within the platform, and allowed moving data into any type of external component.
Rupal Shah: So for example, elastic into Hive or just sitting on the MapR Converged Platform so that you can query data directly via Drill. This user job, let's say MapR Streams , it works really well with the Microservices architecture where in you may want to decouple different end points or different movements. Let's say from CDC you push data into MapR Streams and then from MapR Streams you may wanna write a Spark job that does different functionalities for different consuming end points as well. And that can also, alongside StreamSets and MapR work together nicely in enabling different versions, catering for different types of consuming end points, for all these Microservices as well.
Rupal Shah: Alright, so I want to show you a demo which is gonna talk about the end to end-flow for change data capture and ... let me share the screen.
Rupal Shah: So the scenario we're looking at here is going to be two-fold. One is getting data out from external relational systems via CDC for change data capture, pushing data into the MapR-ES (Now called MapR Event Store) which is the MapR Streams on the MapR converged platform. From there we will see how that persists automatically into the MapR JSON tables which are enabled for CDC, and then the next piece will be for processing the data which is reading the data out of the CDC consuming end point, to what type of change data we're getting out of the CDC consumer and what we can do with that data which is coming out of it.
Rupal Shah: Let's take an example of an Oracle database which is enabled for change data capture. As I went through the actions are happening and committed on the database, we will leverage those audit logs from Oracle itself, and push it downstream into the MapR JSON table.
Rupal Shah: On the left side, I have my collection made directly through the Oracle database, and here ... let me again start it up ... give me one second I'm just gonna login again with ...
Rupal Shah: So on the left side, I've connected to the Oracle database and here let's start putting in some transaction data. Let's do, insert into this table for example, let's do 15. I'll put a credit card number and a transaction that is made for a specific amount and insert that. Let me reduce the credit card number. Okay, a wee mistake.
Rupal Shah: Alright, so one rule has been inserted. It's not gonna be captured by the log unless a commit has been made. Now this data set has been pushed into PC transaction attempt table and on the right side, we're querying the MapR Database JSON table which is for the same transaction that I gave being captured on the MapR side. Let's also capture that. Give me one second.
Rupal Shah: I'm connecting to Drill which is what we're gonna use to query that JSON table and ... and for the word that we just entered on in Oracle will query that for the same table on Drill side. For here the ID is 15, do a select star from the table that you've captured, and look at what has been inserted. They can see the ID was captured as the same user ID that we added for the Oracle table, the amount, the card number, we've masked it, as we process the data, and I'll show you how we do that in StreamSets, and you can see all the different field that have been captured by the JSON table.
Rupal Shah: Let's go ahead and add maybe another update function for the same type of record so we can do an update ... the table link. Let's change the amount in this case to 1,500 for example. For the same record we just insert it. And we'll commit that, and repeat the same for the JSON table and you can see the transaction amount has been updated to 1,500 for the same record for 15 user ID. So what's happening here is within StreamSets, here, there is a pipeline that is basically taking in data from the Oracle CDC log miner, it's processing that data depending on what type of transaction is coming through, and then pushing it through MapR Streams (Now called MapR Event Store). And there is another pipeline in StreamSets which is digging the data out of MapR Streams for the same CDC data and pushing it into the JSON table.
Rupal Shah: Let's double click on each of these specific icons to see what it looks like. So have the Data Collector open. What you're seeing here on this page is a data flow performance manager which gives you the entire topology for how data is moving in its entire application.
Rupal Shah: The actual movement, which is being done on these rectangles that you're seeing, are being done in data collector phase. So I'm just gonna jump into the data collector where the data is running and for the first phase which is Oracle CDC into MapR file system, I'm getting data from the origin, for Oracle CDC Client, it gets data from the table that configured there, they can see from the origin side. I'm taking in data from both the tables.
Rupal Shah: So an example from the first table but you can have multiple tables that can be captured by StreamSets for all the different transaction types as well. And then you saw the credit card being masked, so I'm just spread a masking, basically all the digits but the last four. But before I do that, in the output for the JSON table you saw that I'm also getting the card type there as well. And that can be done in StreamSets as well. So I've put a very simple script that understands what is the credit card number, if it starts with a four, I'm allocating VISA as a credit card type. So it's a new field that I'm adding as part of the payload itself.
Rupal Shah: And then because we're leading data from Oracle CDC, we also capture all the metadata information. So for example it's an insert operation, we'll capture that as part of our record metadata. If it's coming from a specific table, we'll also capture that as part of the metadata. So it gives you flexibility in that you could push this to different topics on the MapR Streams (Now called MapR Event Store) without having to know what specific table is.
Rupal Shah: We could create specific topics for different table name as well in MapR Streams, and it can automatically be handle with a single pipeline. So this first pipeline takes care of taking all of your different operations, pushing it through MapR Streams. The second operation is being done by the second pipeline which is taking data from the same topic for MapR Streams and just doing simple conversions. So I'm getting the ID so that I can push it into the JSON table but I'm also getting the table made from which it is coming so that I can plug it for the JASON table as well.
Rupal Shah: So here, I show you the configuration. It's just simple information like, where is that table name residing. If that table name changes, automatically, StreamSets can also create a table for you on the MapR Converged Platform. So it's very flexible so that you don't have to create these tables upfront.
Rupal Shah: It also caters to all the [inaudible 00:29:31] operations to create, replace, update, delete and [inaudible 00:29:36] operations happening on the Oracle side, automatically they will be updated or catered to on the JSON side. So you saw that insert and updated are taken care of, we'll also show you how it works with delete as well.
Rupal Shah: Now the third phase is now that we data going into the JASON table, how the StreamSets work by taking data out of CDC stream for my MapR Database tables as well. So with the new functionality alongside the MapR 6.0, we're also gonna release with StreamSets a new functionality where we can read data out of CDC for MapR Database. You'll see a new origin coming up where we can read data out of different tables, whether it's binary or JASON table which are enabled for change data capture. When they're enabled for change data capture, automatically you will get similar to MapR Streams (Now called MapR Event Store), you'll get a topic which is basically capturing all the change events which are coming in from all those different tables.
Rupal Shah: When they're enabled, you can just plug in the different tables, the different topics which are giving you that change log stream. And then again, perform the same operation you want to and plug them into the different end points you have in the different destinations that you care about.
Rupal Shah: So what I'm doing here in this pipeline is I'm taking in the same change data stream coming from the MapR Database and from the tables which are again enabled for change data capture to have also enable the same table that you're seeing here, which is CC Transaction Attempt US for change logs. So it's enabled there. As in when these transactions happen on the CDC side, so we didn't insert, we didn't update. At the same time this specific pipeline is taking in that data set and it's pushing it into elastic. Elastic is also doing the same thing. Inserting and updating the same data set for the document for that ID. At the same time, I'm also pushing it into an audits table for JSON into my MapR file system.
Rupal Shah: So the table on this side is gonna overwrite the different transactions that are happening, but I also wanna keep a track of all the different changes which are happening without overwriting. So I'm just creating at separate destination to cater for that.
Rupal Shah: So if I go back to this specific Drill, I do have that different table which is capturing the audit table, so let's do a quick select on that. The audit table is called change log audit. Let's see what happens with the ID 15 which we inserted and then updated as log.
Rupal Shah: So I've done previous transactions before for 15 but you can see that the last few transactions there, was an insert done for 15. So first there was a delete done but I inserted it again for 999 and then you can see there was an update transaction as well done for ID, for 215. So can keep a nice history for all the different auditing from the change log screen for CDC as well.
Rupal Shah: And to show you also that how Elastic is also capturing this data set, let's just jump back into Elastic, I'll get all the documents for this, for the ID 15 ... anyhow, the same Elastic search destination also captured the same operations which are being captured by CDC. So you have Elastic configuration where you can provide the default operation which can be insert, update, delete or index, which is the same as insert or update logic. So absurd logic as well.
Rupal Shah: Let's go back to ... Now, with StreamSets you saw the capability with CDC Origin for Oracle, but there's also functionality available for different types of databases. They provide CDC functionalities for MySQL, for SQL Server as well, as well as for non SQL databases like mongoDB, for Elastic search as well, you can get similar to change tracking coming out of Elastic and so on.
Rupal Shah: So you saw the functionality with MapR Database with CDC as well and this can cater to both [inaudible 00:34:23] and binary. And then there's also new functionality which if you're not familiar, if you're getting data with Sqoop related command into MapR file system today or MapR Converged Platform today, there is a specific DSL that StreamSets provided, has provided, rather, where you can leverage the same Sqoop import command and use the DSL that we have catered to where in you get StreamSets Sqoop import and provide the same parameters you provide to Sqoop, and what that'll do is automatically create a StreamSets data collecting pipeline, and then you can just start running that pipeline in data collector itself.
Rupal Shah: Now the benefits of doing that verus Sqoop are gonna be multiple folds. One is unlike Sqoop, with data collector, you can just keep that pipeline running continuously in the data collector. And that will keep running, not just in a continuous fashion but it can do automatically incrementing pulls as well. And what you get out of that is you get automated monitoring for the data set, so as you can see, by default for all the different pipelines you get a plethora of different metrics as data is moving through. So you don't have to worry again about scheduling the different Sqoop import jobs which different schedulers. And StreamSets can take care of that automatically.
Rupal Shah: And the second biggest phase is the data drift awareness. So if there are changes that are happening on the database table side, if somebody changes a column, data type, adds different fields, StreamSets automatically cater to that by pushing the additional column changes into the downstream systems. Whether it's MapR Database, whether it's Hives and so on. It's very flexible in that as well.
Rupal Shah: Typically with Sqoop as well, you would have sort of scaffolding done around the import commands whether you're scheduling them, whether you're putting in different control tables to see where the offsets were, all of these is taken care of automatically with StreamSets Data Collector. So do have a look at this, there is a link provided in the slide as well for more details about this.
Rupal Shah: So in summary based on what we showed you with the MapR platform and the StreamSets capability. So on the MapR side we're introducing MapR Database as this multiple data model database, particularly the JSON database. We introduced the secondary indexes as well and the ability to increase your query speed by an order of magnitude by leveraging secondary indexes. And then looking at the change data capture and ability to access data and then create real time applications based on the evolving source of truth from the stream.
Rupal Shah: Similarly, StreamSets is fully integrated with the MapR Converged Platform so it has capabilities to get data out of MapR file system, MapR JSON, or Wire CDC, with MapR Streams (Now called MapR Event Store) as well, and also pushing data back into all of these different ecosystems within the MapR platform. And it does that in multiple ways. So StreamSets generally runs either in its stand-alone mode, within its own GEOEM, but it can also leverage the MapR or Spark systems which is on the cluster that MapR system.
Rupal Shah: Alongside that the thing that provides variety of different connectors to pull data out of and push data in to. And you can do basic, different types of transformation for the data set as data is in motion between the different connectors. It also works really nicely in the microservices architecture, our MapR Streams. So you could provide different pipelines, leveraging the MapR Streams for different types of consuming endpoints. And that enables the entirety of enabling real time applications alongside both Streamsets on top of the MAPR converged platform.
David: Okay. Thank you Rupal and Audrey. So we're gonna do some Q&A. There are a number of questions that came in throughout the presentation, and we'll try to get to all those right now. Just a reminder if you do have a question go ahead and submit it via the chat box in the lower left hand corner of your browser. All right, let's get started with a couple of quick ones first so we make sure to get these.
David: So Audrey, this one's for you. What is MAPR-ES?
Audrey Egan: MAPR-ES is just referring to MapR Streams (Now called MapR Event Store).
David: Rupal, this one's for you. Where does StreamSets run? Does it run on Spark of it's own platform?
Rupal Shah: Great question, yeah. So StreamSet's data collector is basically a Java-based application that would run within it's own JBM. But it does have flexibility in that each pipeline could be either spun on the JBM or it can leverage the pipeline can be run as a MAP produce or a spark streaming job on the cluster as well. So it's flexible.
David: Great. Okay another, how to enable... we'll have to try to interpret this but, how to enable multi-table CDC and table pipeline?
Rupal Shah: Yeah that's for StreamSets I believe. All right, so within the pipeline that is showed before, there is a new origin in StreamSets which is gonna be called MAPR-DBCDC. And as when you enable change log for multiple tables on the MAPR system for whether it's JSON or DB, each of these different tables would be tied to a specific topic in MAPR Stream's umbrella. When you do that in the origin as well, you can provide these different topics. If you have multiple topics for different tables or you know a topic for multiple tables as well, it's flexible because the origin can cater to multiple topics and it is also multi-threaded. So it allows you to scale out accordingly within a single pipeline.
David: Okay. Audrey this one's for you. Can we get MapR Streams alone or is it to be combined with the entire MAPR component like DB?
Audrey Egan: Yeah you can segment out the components. It depends on your application each case. In most cases the streaming component will be used with MapR Database. It will be used with the MAPR file system component. So most likely you'd be leveraging streams with some other component. But, yeah it's possible.
David: Does MapR Database provide multi-data center replication?
Audrey Egan: Yes it does. So how that works, within the MapR Database platform, we have these enterprise functionalities like snapshots, like mirroring, like cable replication. And what that means is for example, you have cluster one on the West Coast. Cluster two on the East Coast. You can set up table replication to replicate those exact copies of the table to buster B on the East Coast. This can be done as well at the volume level. And the volume level meaning MAPR has a logical construct for how you can organize your files and your tables. But the directory structure, so at the volume level you can also create mirrors that create a copy to another cluster. And that second cluster could be for a read-operation. Or could serve as a data recovery, disaster recovery location as well.
David: Great. Thank you. Rupal, this one's for you. Does StreamSets work on VSAM main frame system?
Rupal Shah: So StreamSets cannot directly connect to main frame systems. But we do have specific customers who have an indirect way to read data out of [inaudible 00:43:19]. So there's no direct connectivity, but we do have a partner we can work with that can work with the main frame system.
David: Does multi-table CDC support for Microsoft Sequel try, tried but was not able to set up. Not sure what the question is there.
Rupal Shah: All right.
David: Maybe you can rephrase that question.
Rupal Shah: Yeah. So StreamSets does support MS Sequel for CDC, if you looked at the last version from 2.7 onward there is a new origin that works specifically for multiple tables within a single pipeline in a single consumer. So do try that out. If you weren't able to set it up, do reach out the the support channel. We have a Slack channel, we have a Google Group as well. Do reach out to that with what issues you found.
David: Let's see. How does the Oracle CDC on StreamSets work? Do they read the redo logs on Oracle?
Rupal Shah: Right so StreamSets uses or leverages the Oracle log minor capability out of Oracle. And it could read both from redo logs or from the online catalog. So if you do expect changes in schemas, then yes it does cater to reading from the redo logs. If you generally don't expect too many schema changes, then it is faster to work with the online catalog. But StreamSets does allow both.
David: You sort of already answered this I believe, but can StreamSets connect to main frame Atunity to get CDC data?
Rupal Shah: Right so StreamSets works very complementary to Atunity. And I've seen plenty of customers that use Atunity to get data that was main frame or other, let's say DB2 CDC for example. And those customers generally either they can leverage that to push data into some sort of staging area like FR stream, or something where Streamsets can pull data out of. So it is possible to work alongside Atunity, so yes. I would say yes, just that there's no direct link in connectivity from StreamSets, but we leverage it for that.
David: Okay. How does DDL changes on table get handled?
Rupal Shah: I'm assuming this is for the MAPR CDC origin, or is it for Oracle itself? Let's answer both. So for Oracle itself, we cater to all the DDLs. So insert update, delete update for insert as well. So that is handled automatically. For MapR Database CDC, the different DDLs which are put on the table, whether it's insert, update, and delete. Currently the MAPR CDC functionality stores logs basically based on the specific attributes, which have changed. So if you're inserting a new document for example, the CDC origin will give you that new document alongside the underscore for ID field.
Rupal Shah: If you're updating, two fields out of five which are in that document, the CDC functionality for MapR Database will give you those two fields which got updated. If it's going to be a delete operation from MapR Database CBC, it's going to give you just the underscore ID for which document was deleted alongside the operation in StreamSets for what happened to it.
David: Let's see. I think there's a lot of questions like this, but what source systems can StreamSets do CBC on?
Rupal Shah: All right, so StreamSets can do-
David: I was just culminating all the questions into one.
Rupal Shah: Yeah. Fine. These are good questions. So StreamSets can handle CBC out of Oracle via log miner it can do CBC out of Microsoft server as well. And with Microsoft server it can do both change tracking and CBC. So that's pretty beautiful in StreamSets. It can also do SBS from My Sequel as well. So there is a My Sequel bin log origin available. It can also do from Mongo, leveraging the Mongo plug. And similar to not pure CBC but we can do the same thing with a log state search of the origin. So if you have changes that are occurring in the log state, we can continuously pull those as well.
Rupal Shah: And you soon see more CBC connectors, let's say for example posgress and so on. Keep a note on that too.
David: Thank you for that. Does StreamSets CBC have a significant performance impact on Oracle source database?
Rupal Shah: Right, so in general. Whenever, if you're familiar with how Log Minor works, I would really recommend if you're not. But in general when you enable Log Minor, it does offer up data based on how big your transactions are. And that can impact how much memory is being used on the database server. So how StreamSets tackled that is, we do provide a comperation such that you can buffer the same transactions on the client side. Which means where StreamSets data collector is installed. And that can be buffered in memory or on disk so that you avoid having that memory usage on the database site itself. So StreamSets does handle that memory limitation that you may see on the database side.
David: That is the last one. Does StreamSets have any concept of modularity? I need to build quite complex work flows with multiple steps and dependencies. What features to use in StreamSets to do it?
Rupal Shah: Very good question. So, yes. In general, each pipeline, or rather in most scenarios, you would see that StreamSets work in a continuous fashion. And by that, it pulls from the origin that you're plugged in, whether it's CBC or anything else as well. And then pushes to the destination endpoint continuously. Now if you do wanna have dependencies wherein one pipeline wants to kick off another pipeline, or you start off another stream based on specific data sets, it is possible to do that with the usage off connectors. Using REST APIs that can call out the next pipeline, or using the executors we have that can be triggered based on different events.
Rupal Shah: So for example, if you don't want continuous movement, we do have origin and destination that produce events depending on different types of scenarios. So let's say you only want to read one file, and after that file is read- or let's say not even a file, just a table which is pretty small. And once that table is loaded into let's say MAPR, you want to kick off another job like map producers spark job that can do the other processing logic on the MAPR system.
Rupal Shah: So StreamSets does cater to that wherein you can enable something because it's produce events in the origin of the destination side. So as I when data is fully ready from the table or from a file, it generates the event, and then you can use that event to either start the pipeline or to kickoff a map produce job a spark job a different pipeline as well, so it's pretty flexible in how you can orchestrate different pipelines to work together.
David: Awesome. One last, we're gonna do one last one. Just a reminder we will be sending out a link to the recording and to the slides. And so some additional resources shortly after the event. So you can look for those. Okay last question, how can StreamSets help if it is batch-ETL. I have no real kind of CBC requirement?
Rupal Shah: So I think I just covered that in a previous answer. But if it is batch-oriented, which you generally most case see with a lot of relational data movement, StreamSets does also handle that. So as I explained before, the origins for different relational origins basically. There is a JDDC origin that generates an event when there's no more data to read from the data base. Whether it's single table all the tables as well.
Rupal Shah: You can leverage that same event to stop the pipeline. So that becomes your batch type of pipeline. In order to schedule these, you can just log in REST APIs or CLIs that StreamSets provides into your own schedule that you have, it's flexible.
Audrey Egan: So for additional resources we have one upcoming webinar that's just in a couple days here. It kind of continues this story of real time applications and looking at what you can do on the data science side to leverage different machine learning libraries and AI for your data. We also have a workshop coming up as well, the following week, working with different models. And then we have excellent books as well. You can go online and those links will take you to download for those books. They're not necessarily MAPR specific but there's a lot of good content and context in there for the topics we talked about today.
David: Great. Thank you Audrey, thank you Rupal. That is all the time we have for today, thank you to everyone for joining up. As Audrey said, we do have a couple more webinars coming up very shortly on similar topics, on very similar topics. Please visit the link in the deck or in the follow up email that I'll send out, and that's all the time we have. Thank you for all the great questions and have a great rest of your day. Thank you.