6 min read
Hive has been using ZooKeeper as distributed lock manager to support concurrency in HiveServer2. The ZooKeeper-based lock manager works fine in a small scale environment. However, as more and more users move to HiveServer2 from HiveServer and start to create a large number of concurrent sessions, problems can arise. The major problem is that the number of open connections between Hiveserver2 and ZooKeeper keeps rising until the connection limit is hit from the ZooKeeper server side. At that point, ZooKeeper starts rejecting new connections, and all ZooKeeper-dependent flows become unusable. Several Hive JIRAs (such as HIVE-4132, HIVE-5853 and HIVE-8135 etc.) have been opened to address this problem, and it recently got fixed through HIVE-9119.
Let's take a closer look at the ZooKeeperHiveLockManager implementation in Hive to see why it caused a problem before, and how we fixed it.
ZooKeeperLockManager uses simple ZooKeeper APIs to implement a distributed lock. The protocol that it uses is listed below.
Clients wishing to obtain a shared lock should do the following:
Clients wishing to obtain an exclusive lock should do the following:
Clients wishing to release a lock should simply delete the node they created in step 1. In addition, if all child nodes have been deleted, delete the parent node as well.
The above lock and unlock protocols are simple and straightforward. However, the previous implementation of this protocol did not use the ZooKeeper client properly. For each Hive query, a new ZooKeeper client instance was created to acquire and release locks. That causes a lot of overhead to the ZooKeeper server to handle new connections. In addition, in a multi-session environment, it is easy to hit the ZooKeeper server connection limit if there are too many concurrent queries happening. Furthermore, this can also happen when users use Hue to do Hive queries. Hue does not closes the Hive query by default, which means the ZooKeeper client created for that query is never closed. If query volume is high, the ZooKeeper connection limit can be reached very quickly.
Do we really need to create a new ZooKeeper client for each query? We found that it is not necessary. From the above discussion, we can see that the ZooKeeper client is used by HiveServer2 to talk to the ZooKeeper server to be able to acquire and release locks. The major workload is on the ZooKeeper server side, not the client side. One ZooKeeper client can be shared by all queries against a HiveServer2 server. With a singleton ZooKeeper client, the server overhead of handling connections is eliminated. And Hue users do not suffer from the ZooKeeper connection issue any more.
Singleton ZooKeeper client is able to solve the lock management problems. However, we still need to handle some extra things by using ZooKeeper client directly, such as:
Apache Curator is open source software which is able to handle all of the above scenarios transparently. Curator is a Netflix ZooKeeper Library and it provides a high-level API-CuratorFramework that simplifies using ZooKeeper. By using a singleton CuratorFramework instance in the new ZooKeeperHiveLockManager implementation, we not only fixed the ZooKeeper connection issues, but also made the code easy to understand and maintain.
Thanks to the Hive open source community for including this fix in Apache Hive 1.1. This fix has also been included in the latest Hive 0.12 and Hive 0.13 releases and the coming Hive 1.0 release of the MapR Distribution.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.