Mirroring Topics from a MapR Cluster to an Apache Kafka Cluster

You can use MirrorMaker to mirror data continuously from MapR streams in MapR clusters to Apache Kafka clusters.

  • Because this procedure requires that MirrorMaker be run from the MapR cluster, ensure that the mapr-kafka package is installed on the node that you choose to run MirrorMaker from.
  • Configure the node as a MapR client.
  • Ensure that the ID of the user who runs MirrorMaker has the consumeperm permission on the MapR stream.

After you start MirrorMaker, it launches a configurable number of consumer threads to read topics that are in a stream in a MapR cluster and a number of producers to write the messages from those topics into topics in an Apache Kafka cluster.

Figure: Mirroring from MapR-ES to Apache Kafka

Before running MirrorMaker, you create a file that contains the required configuration parameters for the consumers that read from the stream in the MapR cluster. You also create a file that contains the required configuration parameters for the producers that publish to the Apache Kafka cluster. You point to these files in the MirrorMaker command.

You can either specify the topics to mirror or the topics not to mirror. In the former case, you use the whitelist parameter to provide a Java-style regular expression that matches the names of the topics that you want to mirror. In the latter case, you use the blacklist parameter to provide a Java-style regular expression that matches the names of the topics that you do not want to mirror.

  1. Create a file that contains the required properties and values for consumers to use. When you run MirrorMaker, you point to this file by using the consumer.config parameter.
    Property Description
    streams.record.strip.streampath Set the value of this property to true. In messages that are written to MapR streams, the names of topics include the paths and names of the streams in which those topics are located. Apache Kafka needs only the names of the topics. This parameter removes the path and name of the stream that the topics will be mirrored from.
    streams.consumer.default.stream Specifies the path and name of the stream that the topics will be mirrored from.
    group.id A unique string that identifies the consumer group the consumers started by MirroMaker belong to.
  2. Create a file that contains the required properties and values for producers to use. When you run MirrorMaker, you point to this file by using the producer.config parameter.
    Property Description
    bootstrap.servers A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The producers will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
    producer.type Specifies whether the messages are published asynchronously in batches or as data is received by producers. The values are async and sync.
    compression.codec Specifies the compression codec for all messages that are generated by producers. The possible values are none, gzip, snappy, and lz4.
  3. Run MirrorMaker with this command to start mirroring topics from MapR-ES to Apache Kafka:

    Syntax

    bin/kafka-run-class.sh kafka.tools.MirrorMaker 
    
    --consumer.config <File that lists consumer properties and values>
    --num.streams <Number of consumer threads>
    --producer.config <File that lists producer properties and values>
    [--whitelist=<Java-style regular expression for specifying the topics to mirror>]
    [--blacklist=<Java-style regular expression for specifying the topics not to mirror>]
    Parameter Description
    consumer.config The path and name of the file that lists the consumer properties and their values.
    new.consumer Specifies to use consumers that use the Apache Kafka 0.90 API library.
    num.streams Use this parameter to specify the number of mirror consumer threads to create. Note that if you start multiple mirror maker processes then you may want to look at the distribution of partitions on the source cluster. If the number of consumption streams is too high per mirror maker process, then some of the mirroring threads will be idle by virtue of the consumer rebalancing algorithm (if they do not end up owning any partitions for consumption).
    producer.config The path and name of the file that lists the producer properties and their values.
    whitelist A Java-style regular expression for specifying the topics to copy. Commas (',') are interpreted as the regex-choice symbol ('|').

    If you use this parameter, do not use the blacklist parameter.

    blacklist A Java-style regular expression for specifying the topics not to copy. Commas (',') are interpreted as the regex-choice symbol ('|').

    If you use this parameter, do not use the whitelist parameter.

Example

In this example, the file that lists the properties and values for the consumer that will read messages from the topics in MapR-ES is named consumers.props. It contains this list:

streams.record.strip.streampath=true
streams.consumer.default.stream=/myStream
group.id=cg1

The file that lists the properties and values for the producers that will publish messages to topics in Apache Kafka is named producers.props. It contains this list:


bootstrap.servers =10.10.83.93:9092
producer.type=sync
compression.codec=none

The topics to mirror all have names that begin with na_west. When running the command, we can use "na_west*" as the regular expression to use for the whitelist parameter.

bin/kafka-run-class.sh kafka.tools.MirrorMaker --new.consumer
--consumer.config consumers.props --num.streams 2 --producer.config producers.props
--whitelist="na_west*"