MEP 6.3.0 has been released!

7 min read

We are pleased to announce the release of MEP (MapR Ecosystem Pack) 6.3 on our award-winning MapR data platform. As always, we provide the latest in open-source, community-driven innovation through an incremental release vehicle called MEP. This release marks the first ecosystem release since MapR became part of the HPE family, and we are super excited about what's in the pipeline for 2020!

HBase is back!

The feedback we get from customers is the most important input to our product roadmap. You talked and we listened! HBase is getting reintroduced in MEP with version 1.1.13, up from 1.1.8 which we used to have in MapR 5.2.2. The community has made numerous fixes and enhancements in this release that didn't exist in 1.1.8. For a complete list of such fixes, see the following links:

What's new in HBase 1.1.9?
What's new in HBase 1.1.10?
What's new in HBase 1.1.11?
What's new in HBase 1.1.12?
What's new in HBase 1.1.13?

In addition, HBase integration with our platform runs deep. HBase is secure by default and supports MapR security, Kerberos, and over-the-wire SSL encryption. It is managed by warden and integrates with our built-in MapR Monitoring service. You could also run HBase applications along MapR Database applications in the same cluster.

Spark upgraded to 2.4.4

One of the major features added in this release of Spark is the ability for queries to use MapR Database JSON secondary indexes to provide fast interactive responses by leveraging the OJAI 3.0 APIs. In addition to supporting secondary indexes, Spark also gives you the ability to hint which secondary indexes to use (if there are multiple indexes that are pushdown eligible). Here is the newly added method:

setHintUsingIndex(indexPath: String)

Here's a scala example of how this works:

val result = sparkSession
      .setHintUsingIndex(indexPath)
      .loadFromMapRDB(tableName)
      .select(columnNameForIndexing)

MapR Database JSON supports both buffered and unbuffered writes. Buffered writes load the data into memory and periodically flush the writes to disk. It's faster, but the job needs to be restarted in case of failures. Unbuffered writes flush the data to disk for every write operation. They are slower, but the Spark task needs not restart from the beginning in case of load failures. Up until the previous release, Spark's MapR Database JSON connector used to default to buffered writes. In this release, customers have the option to choose between the two load modes by using the following newly added method:

setBufferWrites(bufferWrites: Boolean)

Both the methods above can be called from any of these classes - SparkSession, SparkContext, MaprdbJavaSparkSession, MaprdbJavaSparkContext

You could also join a DataFramethat's in memory with a MapR Database JSON table without performing the join in Spark's memory. This cool way to push down JOIN's into MapR Database not only makes joins run faster by using secondary indexes if they exist, it also save memory by not building large in memory hash tables. You can learn all about this really cool feature and how you can use it here.

You can also read multiple MapR Database JSON tables at once into a UNION Spark dataframe as though they were one table. Consider the following example:

You want to create a separate table to store each day's data and have the following table structure in MapRFS:

/web_logs/20191201 /web_logs/20191202 /web_logs/20191203

...

Organizing the data this way and having each day's data in separate volumes makes deleting the data much easier. Deleting a volume is so much more efficient than deleting individual records in a table. But it makes things harder when you query. You would have to loop through all the tables and read them one at a time. But not anymore! This release supports the ability to wildcard table names. If you wanted to read all the tables, you can simply wildcard the table name. For example:

valweb_logs = spark.loadFromMapRDB("/web_logs/*")

Or loading only some of them that match the pattern.

valweb_logs = spark.loadFromMapRDB("/web_logs/2019[1-2]*")

There is new functionality added to RDD[A], DataFrame, Dataset[A], and DStream[A] via the inclusion of the package org.apache.spark.streaming.kafka.v2.producer._called sendToKafka that allows us to write all these Spark data structures to MapR streams in a safe manner.

For more detailed on how these new features work in the MapR Database JSON connector for Apache Spark, check out our blog maintained by Nicolas Perez - our professional services expert here.

In addition, this version of Spark is read/write compatible with HBase v1.1.13 that's being released. The open-source community has also been actively making enhancements and fixes, as they always have. For a full list of new HBase features and fixes since the previous MEP, see the Apache links below.

What's new in Spark 2.4.1?
What's new in Spark 2.4.2?
What's new in Spark 2.4.3?
What's new in Spark 2.4.4?

Updates to Drill 1.16

This version of Drill includes fixes made on the previous Drill 1.16 release, which was released in MEP 6.2. This version of Drill includes a fix that provides compatibility with Hive, as explained below.

When Drill reads parquet files generated by Hive or Spark with complex types in them, it displays the results differently from how Spark displays them. Consider the following example:

Let's say I have a table like the following:

CREATE TABLE `ORDER_DETAILS_1`(
  `order_id` bigint,
  `order_items` map<bigint,struct<Item_amount:bigint,Item_type:string>>)

When I load data and query using Hive, here is the output:

{101:{"Item_amount":2,"Item_type":"Pencils"},102:{"Item_amount":1,"Item_type":"Eraser"}}
{102:{"Item_amount":1,"Item_type":"Eraser"},103:{"Item_amount":1,"Item_type":"Coke"}}

When I query using Drill, this is the output I get which, although correct, is different from how Hive displays it:

0: jdbc:drill:drillbit=xx.xx.xx.xxx>
select * from dfs.`/user/hive/warehouse/order_details_1`;
+----------+-------------+
| order_id | order_items |
+----------+-------------+
| 1 | {"map":[{"key":101,"value":{"Item_amount":2,"Item_type":"Pencils"}},{"key":102,"value":{"Item_amount":1,"Item_type":"Eraser"}}]} |
| 2 | {"map":[{"key":102,"value":{"Item_amount":1,"Item_type":"Eraser"}},{"key":103,"value":{"Item_amount":1,"Item_type":"Coke"}}]} |
+----------+-------------+
2 rows selected (0.37 seconds)

This release of Drill includes a fix that displays the output the same way Hive does.

MEP 6.3.0 is available for install from our software repository and documentation can be found at https://mapr.com/docs/61/MEPs/MEP_6.3.0_reference.html.

Thank you for reading. Stay tuned for more exciting product releases and announcements planned for 2020!


This blog post was published December 17, 2019.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now