MapR 5.0 Documentation : Impala User-Defined Functions

User-Defined Functions Overview

A user-defined function (UDF) is a SQL function that you create to encapsulate code that processes column values during an Impala query. UDFs are called from within a SQL statement, like a regular function, and return a single value. Starting in Impala 1.2.3 for MapR, you can create UDFs to perform custom calculations and transformations that built-in SQL operators and functions do not provide.

After you download and install the UDF development package, you can write your own UDFs and invoke them in the queries that you run on Impala. Impala can run scalar UDFs and UDAFs (user-defined aggregate functions).

UDFs and UDAFs

UDFs and UDAFs differ in the number of rows that they accept as input. A UDF operates on a single row and produces a single row as the output. When you use a UDF in a query, the UDF is called once for each row in the result set. Mathematical and string functions are examples of UDFs. When you create a UDF, you issue the CREATE FUNCTION statement. After you create a UDF, you can use it in the expression of a SQL statement. 

A UDAF operates on multiple input rows and produces a single row as output. The COUNT(), MAX(), SUM(), and AVG() aggregate functions are examples of UDAFs. You might use a UDAF in a query with a GROUP BY clause to produce a result set with a separate aggregate value for each combination of values from the GROUP BY clause. When you create UDAFs, use the CREATE AGGREGATE FUNCTION statement. After you create a UDAF, you can use it the expression of a SQL statement.

When you create UDFs and UDAFs, they cannot have the same name as any of the built-in functions.

UDFs in C++ and Java

Impala accepts UDFs and UDAFs written in C++, as well as Hive UDFs written in Java. A Java UDF may cause a query in Impala to run much slower than the equivalent native UDF written in C++. If you use Hive UDFs when you query Impala, the Hive UDFs must meet the following conditions:

  • Parameters and return values must all use data types that Impala accepts. Impala does not accept nested and composite types.
  • UDFs cannot accept or return the TIMESTAMP data type.
  • The return type must be a writable type, like Text or IntWritable. UDFs return a NULL value for Java primitive types, like string or int.
  • Impala does not accept Hive UDAFs or UDTFs.

For more information about Hive UDFs, refer to the Hive Language Manual UDF.

Get Started with UDFs

Before you can write UDFs for your Impala queries, you must install the Impala development package that contains header and build configuration files, and download the sample source code.

The development package provides a header file that is required to develop UDFs and UDAFs. The header contains the types that must be used and the FunctionContext object. This object is the interface object between the UDF or UDAF and the Impala process. You can access the header in /usr/include/impala_udf/udf.h

The sample source code provides sample files that you can view to see the declarations required to write UDFs and UDAFs, and sample source code used to create a simple UDF and UDAF example.

After you have installed the development package and sample source code, you can build the samples to create an environment that you can use to write your UDFs and UDAFs. Once you have completed your UDF or UDAF, you can deploy it.

To summarize, you must complete the following steps to create UDFs and UDAFs for use with Impala:

  1. Download and install the development package.
  2. Download the sample source code and build scripts.
  3. Build the samples.
  4. Write your UDF or UDAF.
  5. Deploy your UDF or UDAF from the impala-shell.

Installing the Development Package

Use the appropriate commands for your operating system to install the development package.

To install the development package, complete the following steps:

  1. Install the gcc-c++ compiler and boost-devel using the command appropriate for your environment:

    RedHat and CentOS
    $ sudo yum install gcc-c++ cmake boost-devel
    Ubuntu
    $ sudo apt-get install g++ libboost-devel cmake
  2. Install the development package, using the command appropriate for your environment:

    RedHat and CentOS
    $ sudo yum install mapr-impala-udf
    Ubuntu
    $ sudo apt-get install mapr-impala-udf

    Impala is not required for UDF development. You can create UDFs on a minimal development system and then deploy the UDFs to a machine with Impala. 

Downloading Sample Source Code

To download the sample source code and build scripts, issue the following command:
git clone https://github.com/mapr/impala-udf-samples 

The following table provides the source code file names and their descriptions:

File Name

Description

udf-sample.h

Header file that declares the signature for a scalar UDF.

udf-sample.cc

Sample source for a simple UDF that adds two integers.

udf-sample-test.cc

Basic unit tests for the sample UDF.

uda-sample.h

Header file that declares signature for sample aggregate functions.

uda-sample.cc

Sample source for simple UDAFs that show how to manage state transitions as underlying are called during various phases of query processing.

uda-sample-test.cc

Basic unit tests for the sample UDAFs.

Building the Samples

To build the samples, complete the following steps:

  1. Issue the cmake configuration command.
    This command reads the file CMakeLists.txt, and generates a Makefile customized for your directory paths.
  2. Issue the make command.
    The make command runs the build steps based on rules in the Makefile. Impala loads the shared library from a MapR-FS location.

The samples get built to build/. This contains test executables that you can run locally, without the Impala service installed. It also contains shared object artifacts that you can run on Impala.

At this point, you can write your UDFs or UDAFs. When you write your functions, build a shared library to contain them. Use the mapr or hadoop commands to copy the binary files to a MapR-FS location that Impala can read from.

Writing UDFs

When you write UDFs, use function-oriented programming best practices, and note data type differences when transferring values from high-level SQL to low-level UDF code.

The UDF code samples that you downloaded include the udf-sample.h and udf-sample.cc files that you can reference when you write your UDFs. The udf-sample.h file provides the basic declarations required to write a scalar UDF. This file defines a simple function named AddUdf() that you can reference. The udf-sample.cc file provides sample C++ code for a simple function named AddUdf().

Function Argument/Return Value Data Types

Every value that a UDF accepts as an argument or returns as a result, must map to a SQL data type that you can specify for a table column.

Every data type has a corresponding structure defined in the C++ and Java header files with two member fields and some predefined comparison operators and constructors:

  • is_null indicates if a value is/is not NULL. When non-NULL, val holds the argument or return value.
  • null () is a member function that constructs an instance of the struct with the is_null flag set.
  • <,>=,BETWEEN,ORDER BY are built-in SQL comparison operators and clauses that work automatically based on the SQL return type of each UDF.
  • Every struct within your UDF code defines == and != operators for comparisons with structs of the same type for typical C++ comparisons within your own code. Each kind of struct one or more constructors that define a filled-in instance of the struct.
  • Every type of struct has a null() member function that returns an instance of the struct with is_null flag set.
  • Impala cannot process UDFS that accept or return composite or nested types. This applies to UDFs written in C++ and Jave-based Hive UDFs.
  • You can create multiple functions with the same SQL name and different argument types to overload functions, however you must use different C++ or Java entry point names in the underlying functions.

The following table lists the data types defined for C++ in /usr/include/impala_udf/udf.h:

Data Type

Description

IntVal

Represents an INT column.

BigIntVal

Represents a BIGINT column.

SmallIntVal

Represents a SMALLINT column.

TinyIntVal

Represents a TINYINT column.

StingVal

Represents a STRING column. It has a len field that represents the length of the string and a ptr field that points to the string data. It also has a constructor that creates a new StingVal struct based on a null-terminated C-style string or a pointer plus a length. It also has a constructor that takes a pointer to a FunctionContext struct and length, which does not allocate space for a new copy of the string data that you can use in UDFs that return string values.

BooleanVal

Represents a BOOLEAN column.

FloatVal

Represents a FLOAT column.

DoubleVal

Represents a DOUBLE column.

TimestampVal

Represents a TIMESTAMP column. It has a 32-bit integer date field that represents the Gregorian date and a 64-bit integer time_of_day field that represents the current time of day in nanoseconds.

 

Variable-Length Argument List

Each argument is named explicitly in the signature of your C++ function. A UDF can take a fixed number of these named arguments, however a function can accept additional arguments if they are all of the same type.

You must code the signature of your function using the following format if you want your UDF to accept a variable-length argument list:

StringVal Concat(FunctionContext* context, const StringVal& separator,
int num_var_args, const StringVal* args);

The SQL query call must pass at least one argument to the variable-length portion of the argument list. Impala calls the function and fills in the initial set of required arguments and then passes the extra arguments and a pointer to the first of the optional arguments.

NULL Values

Each UDF that you write must have the ability to handle NULL values. If a UDF receives a NULL value, it typically also returns a NULL value.

UDF Memory Allocation

Any memory allocated to a UDF is taken back by the system after the UDF exits. Input arguments remain allocated for the lifetime of a function. You can refer to the input arguments in expressions for return values. When using temporary variables to construct all new string values, use the StringValue() constructor, and copy the data into the newly allocated memory buffer. The StringValue() constructor takes an initial FunctionContext* argument followed by a length.

UDF Error Handling

Call functions that are members of the initial FunctionContext* argument passed to your function to handle UDF errors.

Use the following signature in the function to record warnings for conditions that indicate minor, recoverable problems that do not cause the query to stop:

bool AddWarning(const char* warning_msg);

Use the following signature if you want the UDF to set an error flag that prevents the query from returning any results in serious cases where the query cancellation must occur.

void SetError(const char* error_msg);

Writing UDAFs

A UDAF must maintain a state value across subsequent calls in order to accumulate a result across a set of calls, instead of deriving it purely from one set of arguments.

The underlying functions represent a UDAF:

  • Initialization function. Sets counters to zero, creates empty buffers, and performs any other initial setup for a query.
  • Update function. Processes the arguments for each row in the result set and accumulates an intermediate result for each node.
  • Merge function. Combines the intermediate results from different nodes.
  • Finalize function. Passes through the combined result unchanged, or does one final transformation.

Deploying UDFs and UDAFs

To deploy your UDFs or UDAFs, complete the following steps from the impala-shell:

  1. Issue a USE statement to the database that you want to associate your new function with.
  2. Issue a CREATE FUNCTION statement in the impala-shell to make Impala aware of the new function.
    • To deploy a UDF, issue the CREATE FUNCTION statement.
    • To deploy a UDAF, issue the CREATE AGGREGATE FUNCTION statement. Specify the entry points of the underlying C++ functions using the clauses INIT_FN, UPDATE_FN, and FINALIZE_FN.

You can also issue the CREATE FUNCTION with a fully qualified database name to skip the USE statement step. 
Example: CREATE FUNCTION <db_name.function_name>

Hive UDFs

To use Hive UDFs in Impala, get the applicable JAR files from the Hive UDFs, and use the Impala UDF deployment process to create new UDFs with new names.

Data types of arguments must match the function signature exactly when reusing Hive Java code for built-in functions.

To use a Hive UDF with Impala, complete the following steps:

  1. Get a copy of the Hive JAR file with the UDFs that you want to use with Impala.
  2. Issue the following command to see a list of classes inside the JAR file:
    jar -tf <jar_filename>
  3. Copy the JAR file to a MapR-FS location that Impala can read.
  4. From the impala-shell, create a database to use with the UDF.
  5. To identify the data base that you want to query using the UDF, issue the USE statement to through the impala-shell for that particular database, or specify the SQL function name as db_name.function_name.
  6. Issue a CREATE FUNCTION statement for each UDF that you want to use with Impala. The CREATE FUNCTION statement should contain a LOCATION clause with the full MapR-FS path to the JAR file and a SYMBOL clause with a fully qualified name of the class. Use dots as separators. Do not use the .classpath extension.
  7. Issue a query and call the function. Pass the correct type of arguments to the function.

UDF Limitations

Impala UDFs have the following limitations:

  • Impala cannot work with UDFs that accept or return composite, nested, or types not available in Impala tables.
  • UDFs must produce the same output each time the same argument value is passed.
  • UDFs cannot spawn other threads or processes.
  • UDFs become undefined when you restart the catalogd process. In this scenario, you must reload the UDFs.
  • You currently cannot include user-defined table functions in Impala queries.

UDF Security

If you enable the Impala authorization feature, consider the following:

  • You must have the required read privilege for database and tables used in the query to call a UDF in the query.
  • Only an administrative user can create UDFs because improperly coded UDFs can cause performance and capacity issues. Issuing the CREATE FUNCTION statement requires the ALL privilege on the server.