Getting Started with CDC

This topic describes an end-to-end flow of how to establish and use Change Data Capture (CDC). It assumes that a new table and dataset will be created, although an existing table with data can also be used.

End-to-End Workflow

To use the Change Data Capture (CDC) feature, the following diagram shows an end-to-end workflow.
Note: Steps 2 and 3 are interchangable. You may decide to start the consumer application for CDC changed data records before performing CRUD operations on the table.
Setup the CDC Environment Using dbshell to perform CRUD operations on MapR Database JSON tables Developing client applications for MapR Database JSON tables. Using HBase to perform CRUD operations on MapR Database binary tables. Developing client applications for MapR Database binary tables. Consuming CDC changed data records
  1. Setup the CDC environment.
    1. If you are propagating changed data from a source table on a source cluster to a destination stream topic on a remote destination cluster, a gateway must be setup. Gateways are setup by installing the gateway on the destination cluster and specifying the gateway node(s) on the source cluster. See Administering MapR Gateways and Configuring Gateways for Table and Stream Replication.
    2. If you have a secure cluster, secure configuration must be setup. See Configuring Secure Clusters for Cross-Cluster Mirroring and Replication.
    3. Establish a MapR Database table (JSON or binary) with data. You can create a new table and add data or use an existing table with data. See maprcli table create for creating a new table or use the MCS. If you are using an existing table with data, skip to the next step.
    4. Create a MapR Event Store For Apache Kafka stream for the propagated changed data records using the maprcli stream create -ischangelog parameter. See maprcli stream create or use the MCS.
    5. Create a MapR Event Store For Apache Kafka stream topic for the changed data records. You can use the maprcli stream topic create command, the maprcli table changelog add command (this command creates a changelog relationsip between the source table and the destination stream topic), or the MCS when creating either a stream topic or a table changelog.
    6. Create a changelog relationship between the source table and the destination stream topic with the maprcli table changelog add command or use the MCS. By creating a changelog relationship, you are creating an environment that propagates changed data records from a source table to a MapR Event Store For Apache Kafka topic.
      Note: Propagation of existing table data is enabled by default. If you do not want to propagate existing source table data, set the -propagateexistingdata parameter to false. The default is true.
      Note: Propagation is enabled as soon as the table changelog relationship is added. If you do not want propagation to begin, set the -pause parameter to true. The change data records are stored in a bucket until you resume the changelog relationship; at this point, the stored change data records are propagated to the stream topic. See table changelog resume for more information.
    7. Verify that the changelog exists. See table changelog list for information about your changelogs.
  2. Perform CRUD operations (inserts, updates, and deletes) on the source table. The following utility and application can be used:
  3. Write a consumer with the Apache Kafka and OJAI API libraries that subscribes to the topic and consumes the change data records. There are multiple interfaces that are used for writing a CDC consumer. See Consuming CDC Records for a list of interfaces. See Building Consumers for CDC for an example.

Use Cases

Table 1.
Scenario Setup Task Notes
You want a CDC stream topic to contain all of the table data as changed data records. You would setup CDC in the following manner before performing operations on the source table and consuming the change data records.
  1. Create an empty source table.
  2. Create the changelog stream.
  3. Create the changelog stream topic.
  4. Add the table changelog relationship. In this case, it doesn't matter if the -propagateexistingdata is set to true or false because you're starting with an empty source table.
  5. Verify that the changelog exists and that replicaState is REPLICA_STATE_REPLICATING. See table changelog list for more information.
In this case, all table data is propagated to the stream topic as change data records and the operation type is identified on each individual data record.
You want a CDC stream topic to contain all of the existing table data and changed data records. You would setup CDC in the following manner before performing operations on the source table and consuming the change data records.
  1. Create a source table and add data, or alternatively, use an existing table that contains data.
  2. Create the changelog stream.
  3. Create the changelog stream topic.
  4. Add the table changelog relationship. Be sure that the -propagateexistingdata parameter is set to true. If you are using the command line to add the changelog, then this parameter does not need to be specified because the default is true.
  5. Verify that the changelog exists and no error is reported in the changelog list. When all the existing data in the table is delivered to the changelog, the replicaState becomes REPLICA_STATE_REPLICATING. See table changelog list for more information.
In this case, the existing table data is propagated to the stream topic and that data's operation type is identified as a SET operation. Subsequently, operations on the source table are propagated as changed data records and the operation type is identified on each individual data record.

You can consume data at any time, however, there may be a delay before all of the existing table data is completely propagated, expecially if you have a large dataset. Be sure to check the copyTableCompletionPercentage field.

You want a CDC stream topic to not contain any original table data and to capture only subsequent changed data records You would setup CDC in the following manner before performing operations on the source table and consuming the change data records.
  1. Create a source table and add data, or alternatively, use an existing table that contains data.
  2. Create the changelog stream.
  3. Create the changelog stream topic.
  4. Add the table changelog relationship. Be sure that the -propagateexistingdata parameter is set to false.
  5. All new data operations applied to a source table after the replicaState becomes REPLICA_STATE_REPLICATING is not treated as original data and is delivered to the changelog. See table changelog list for more information.
In this case, the existing table data is not propagated to the stream topic and the operation type is identified on each individual data record.