15 min read
Earlier this year, I published a series of posts on the deployment of Apache Drill to Azure. While the steps covered in those posts work, I’d like to speed up the process significantly. With the MapR Data Platform available in the Azure Marketplace, I can have a Drill-enabled MapR cluster up and running much faster and with much less effort.
In this post, I’ll tackle how to spin up such a cluster quickly. In a subsequent series of posts, I’ll tackle integration of the cluster with a few key Azure resources. Together, I hope these entries will help accelerate the work of those interested in working with Drill on MapR in the Azure cloud.
If you have an Azure subscription, log in to the Azure Portal at https://portal.azure.com. If you do not have a subscription, you should contact your organization’s Azure administrator to obtain one or otherwise set up a pay-as-you-go account.
If you haven’t yet run anything sizeable in your Azure account, you should be aware that the account has a default quota limit of 20 virtual machine cores available to it. This cap is intended to keep users from inadvertently running up a large bill. If you intend to deploy a small, i.e. 3-node, cluster as shown in this post, you shouldn’t hit the quota limit. Should you need to go larger, you may want to go ahead and request the quota be increased to accommodate your needs.
Once signed into the portal, the dashboard populates with a bunch of options. To access the Azure Marketplace, simply click the +New option in the left-hand navigation bar.
In the resulting panel, enter MapR into the search box and click enter. You should see a few options from MapR, including a series of stand-alone Sandbox VMs, which are great for training and demo purposes. But as our goal is to deploy a cluster, select the MapR Data Platform v5.2 item as highlighted in the image below.
Click on this item to bring up a panel within which you can review your selection and access supporting documentation. Click the Create button at the bottom of that panel to start the deployment process.
Through the portal, the deployment process is handled through a series of forms. The first two forms require input, while the last two forms provide validation and confirmation prior to the actual deployment.
In the first form, provide some basic information about the cluster:
Click OK to proceed to the second form.
On the second form, provide the information for the cluster infrastructure:
Click OK to proceed to the Network Information form: you can select an existing Virtual network or create a new one. Whether a new or existing virtual network is selected, choose a Subnet within that virtual network to which to deploy the virtual machines for this cluster.
Click OK to proceed to the Summary/Validation form. Once validation is passed, click the OK button again to proceed to the Buy/Purchase form. As always, carefully read the language on this form, but the gist of the legalese is that expenditures for this cluster infrastructure come out of the billing mechanisms associated with your Azure account, and any expenditure with MapR is outside of those agreements. Keep in mind that the Marketplace image you are using comes with a free 30-day license from MapR for non-production use of the technology. Once you are ready to proceed, click Purchase to start the deployment.
While the deployment is in motion, you will see a spinning tile on the Azure dashboard. In my tests, deployment took between 30 and 60 minutes. Your mileage may vary, but once deployment is finished, you should have a tile pointing to the resource group containing your cluster.
Clicking on that tile takes you into the Resource Group associated with the MapR cluster. You will see a virtual network and a storage account as well as one public IP address, one virtual network card, and one virtual machine for each node in your cluster. Assuming you have a 3-node cluster, you might think of your assets as being organized something like this (though the names of your Azure assets will likely vary):
Coming out of the deployment, your cluster should be up and running. You will now want to connect to the Drill Console to verify that it is operational.
To do this verification, you must first locate the fully qualified, publicly addressable name associated with one of the nodes of the cluster. I suggest doing this for cluster node0, the VM named maprclusternode0 in my example.
To locate the public name for cluster node0, click on the tile for the resource group containing your cluster (as described above). Locate the public IP address associated with that node, the asset named maprcluster-publicIP0 in my example. Click on that resource to open its panel.
Click on the Configuration option in the left-hand portion of that panel, and in the resulting Configuration panel, locate the value assigned to the DNS name label. Copy that name along with the domain name presented under the textbox. If you don’t like the name assigned to the node here, you can modify it; just be sure that the name is unique within the assigned domain.
Armed with the fully qualified name of the node, you should now be able to connect to the Drill Console via HTTP on port 8047. To proceed, open a modern browser and navigate to port 8047 on the machine, using HTTP. (Your address will look something like this: http://maprcluster-3xrrusnk-node0.westus.cloudapp.azure.com:8047.)
Doing this step, you should now be presented with the Drill Console’s default page. Verify that the number of running Drillbits matches the number of nodes in your cluster.
Should any nodes not be running Drill properly, you can connect to the MapR Dashboard on one of the running nodes via port 8443 using HTTPS. Again, use a modern browser to navigate to that address, log in the SysAdmin user name – default is mapradmin - and the password you provided in the first form in the deployment process. Once logged in, locate the Services pane on the right-hand side of the dashboard’s default page.
In the Services pane, click on Drillbit, and on the resulting page, change the filter from Running Services is Drillbit to Running Services is not Drillbit.
If any nodes are configured for Drillbits (under the Configured Services column) but are not running Drillbits, click on the Hostname to open the host’s page. Scroll to the bottom of the resulting page to locate the Manage Node Services pane, locate the Drillbit service listing in it, and click on the associated Stop/Start button to start the service.
The cluster as it exists in its post-deployment state is fairly exposed on the Internet. This situation may be fine if you intend to limit your work to tutorials or publicly available data sets, but before doing anything potentially sensitive with the cluster, you will want to start exercising more control over inbound communications with it. This alteration can be done rather easily by implementing a network security group on the virtual network containing the cluster.
To set up a network security group, click on the +New option in the left-most pane of the Azure Portal dashboard. Enter network security group in the search box and select the Network security group item in the Results list.
On the panel for the Network security group, make sure the deployment model is set to Resource Manager and then click the Create button to configure the deployment.
In the resulting form, enter a Name for the network security group, set its Subscription as before, and assign it to the same (Use existing) Resource group and Location used for your cluster deployment. Click Create to deploy the network security group. Deployment should take less than a minute.
Once the network security group deployment is completed, click on the resource group tile for your cluster. Note the presence of the network security group item in the list of resource group assets.
Click on the network security group item. Click on the Inbound security rules item in the left-hand navigation of the network security group’s panel, and click +Add at the top of the resulting pane to create a new rule.
In the resulting Add inbound security rule panel, set the Name to SSH and leave Priority at 100. For the Source, you can leave the port open to Any IP address or specify a CIDR block that corresponds to your environment. Leave Service set to Custom (though SSH is preconfigured in that drop-down). For Protocol, select TCP and enter a Port range of _22_ with an Action of Allow. Click OK. You now have a rule that allows SSH through the network security group. (You may need to give the Portal a few seconds for the new rule to appear in the list of Inbound security rules.)
Repeat these steps to add inbound security rules for Drill Console (TCP port 8047), Drillbit Connections (TCP port 31010), and the MapR Dashboard (TCP port 8443). (With each rule, allow the Priority to increment as suggested by the interface.)
With the rules in place, you now need to associate the network security group with the public network interfaces tied to your cluster nodes. To do this step, click on the Network interfaces item under the Inbound security rules item in the left-hand navigation of the network security group panel. Click the +Associate item in the resulting pane and select the first of the network interfaces associated with your cluster. Once associated, repeat this process for each of the remaining network interfaces in the cluster.
With the network configuration changes documented here, you have limited some of the exposure of your cluster to the Internet. The network security group you implemented allows traffic to enter on only 4 TCP ports (and potentially from a limited IP address range if you used the CIDR block option). All other ports are now blocked. If you need even tighter control over network access, consider blocking all Internet traffic into your cluster and implement a VPN to reach it.
To control cost, you may want to shut down your cluster when it is not needed. To take this step, return to the resource group associated with your cluster. Click on each virtual machine, starting with the highest numbered one, to access its panel.
At the top of the panel, click on the Stop button. Click Yes to confirm your selection and wait for the virtual machine’s status to change from Running to Deallocated. Once deallocated, the Azure meter is no longer running on the virtual machine. Repeat this process for the remaining virtual machines, working from the highest numbered one to the lowest.
To restart your cluster, simple return to the resource group and select Start for each virtual machine, working from the lowest numbered virtual machine to the highest. Once each virtual machine is in a running state, give the cluster a few minutes for services to start and be responsive before attempting to reconnect to it.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.