5 min read
Here's the undedited transcription:
Hi, I'm Terry, I'm an engineer from MapR Streams team. Today, I'm going to talk about a few turning tapes from MapR Streams. In this example, I'm using an Apache Flink application to run on top of MapR Streams. First, we're using a MapR 5.1.0 cluster with 10 nodes, with YARN installed on the cluster, and we're spinning up Apache Flink 1.0 directly to run against MapR Streams. The reason is because MapR Streams supports Kafka 0.9 API, and any application that supports Kafka 0.9 API will be able to run on MapR Streams seamlessly. Flink being such an application, it's a good pick for us to start with.
We first created a base line with the application in hand to see how it works, and then we're starting from there to tune the applications. In the first round of the applications, we spin up 10 Flink task managers in the cluster, so that's one Flink task manager per node. We have 72 partitions being created for the topics that we're producing to in the stream, and we have our MapR Streams being replicated three way on disk, so for each mastery disk being produced, it will have 3 copies of data being saved on disk for data protection and high availability.
Each of the task managers also spins off 72 stream producers, internally within the single process, to producing the masters to MapR Streams. It also have 72 consumers being spinned off on each process to consume the stream to messages that has been producing to the stream.
In the first round, after the results, we are analyzing it, and we don't feel like it's very good, so we did some profiling on the application, as well as the stream site to see where the bottle necks are. The first bottle neck that we found out is on the producer itself because our producer, Stream Cline, is actually asynchronous model, which means that you have more spreading in same process. We're actually creating a spread contention to our producers. It basically means that your application spread is being fighting with the stream client's internal spreads, so to eliminate that problem, instead of the spinning of one task managers per node, we spinning off 3 task managers per node for Flink, and for 2 of the task managers, we're actually running 1 producer each in the task manager, so each physical process would only have 1 producer instance that's running.
For the leftover task manager, we're running the stream consumer directly on the task manager, so in this case, we segregate out all of the processes contention into basically – to separate them out to not create the CPU bottle neck any more, and because the initial partitions that we had were 72, but we've done a thorough analysis and we actually found out that it's not giving us enough parallelism, so we actually increased that to 300. That's about 30 partitions per node. This combination gives us a really good result at the end, so we were able to get about a 250% increase, versus the initial application tuning that with it ourselves.
Some key takeaways for the stream tuning, first thing: The number of partitions. This is where you achieve parallelism on the surface side, so if is very important, the default number 10 is a good start to go with, but always measure how fast it is going according to your application needs, and your cluster sizing. So, in this case, our 72 is not good enough, but 300 gives us a good result. The second thing is that the producer, which is the stream client, are asynchronous inside, which means it will be able to handle very, very high rate of concurrency internally.
Your application does not need to use a multi-spreaded model to produce messages to our stream client. Instead, most of the time, a single stream will be good enough. A single spread will be good enough to produce messages to our stream client, so I hope these key takeaway can help you to tune your application run much faster, much more reliable, and better on MapR Streams.
If you have any questions or comments about our video, please leave your comments below. Thank you!
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.