4 min read
Pentaho Data Integration (PDI) provides the ETL capabilities that facilitate the process of capturing, cleansing, and storing data. Its uniform and consistent format makes it accessible and relevant to end-users and IoT technologies.
Apache Drill is a schema-free SQL-on-Hadoop engine that lets you run SQL queries against different data sets with various formats, e.g. JSON, CSV, Parquet, HBase, etc. By integrating it with PDI, you have the flexibility to do serious data integration work through Pentaho’s powerful PDI product. The Drill Tutorials pages in MapR’s documentation can help you get familiar with Apache Drill.
You’ll need administrator permissions in order to do these steps. Make sure that you meet the following software requirements:
You should also make sure that the PDI client system can resolve the hostnames on your Drill cluster before you get started.
The first thing you’ll have to do is get the Drill cluster ID and construct a custom URL string. This will be something that we’ll be using a bit later to make the JDBC connection through PDI.
select string_val from sys.boot where name ='drill.exec.cluster-id';
Once you have your custom URL string, follow these steps to make the connection to PDI:
On the off-chance that your connection test doesn’t work, try verifying that your Custom URL string is correct, and make sure your hosts file for the PDI client can resolve the private hostnames of the cluster.
By the time you get to the end of this process, you should have successfully connected your Pentaho Data Integration client to your MapR cluster using Apache Drill. Have fun with your data!