8 min read
During the early days of developing Apache Drill, the Drill team realized the need for an efficient way to represent complex, columnar data in memory. Projects like Protobuf provided an efficient way to represent data that had a predefined schema for transmission over the network, and the Apache Parquet project had implemented an efficient way to represent complex columnar data on disk (based on the original paper on Dremel by the engineers at Google.)
However, neither format was sufficient for the needs of a complex query processing engine like Drill. To overcome those limitations, the Drill team developed Value Vectors which, as it turns out, fulfill the needs of a large number of other similar data processing engines. Value Vectors now form the basis for the Apache Arrow project.
The Design of Value Vectors
For any data processing engine, the main use of Value Vectors is to allow operators to perform their function without any overhead in accessing specific data values. In addition, for performance, passing data from one operator in a (possibly distributed) pipeline to another requires that we make no additional copies of the data. Finally, the representation has to be efficient in representing null and complex types and to be language independent.
Random access in constant time
Value vectors represent data such that an individual value of a column can be accessed in constant time. Row-oriented stores keep data organized by rows. For example:
0,ALGERIA 1,ARGENTINA 2,BRAZIL
An application can access the second record
(1, ARGENTINA) by simply keeping an offset that keeps track of where, in memory, each record begins.
In contrast, a column-oriented store will keep all values of a single column together. The same data in a column store will look like this:
To solve this, Value Vectors keep offsets of the beginning of each value for every column. This increases the overhead a little, but allows accessing elements in constant time.
Efficient subsets of value vectors
Most operations operating on data will output a different number of records from those in the input. A simple example is a filtering operation that will select a certain set of records and remove the rest. To avoid expensive copying of data, Value Vectors allow an abstraction that builds a Vector on top of another Value Vector. The outer Vector does not copy the underlying Vector; instead, it keeps a list of offsets of values in the underlying vector. Any values that need to be removed from the original vector are simply not included in this list.
An early design requirement was to keep the representation independent of programming language and even machine architecture. Value Vectors provide a canonical list of datatypes which are defined in a language-independent manner. A value vector generated by a Java application can be read by an application written in C++, Python, or any other language. In fact, the underlying buffer of a value vector can be serialized directly over a network to another application, irrespective of the language it was written in. All that the application needs is an implementation in the Value Vectors’ public interfaces, and the Apache Arrow project aims to provide that.
For machine independence, Value Vectors assume that the underlying integer representation is little-endian.
Efficient representation of null and complex types
Value Vectors represent nulls such that sparse data sets (those that contain a large number of nulls) do not take up too much memory. Value Vectors maintain a set of bits to keep track of whether a particular value is null or not. If a value is null, it takes no additional space.
Value Vectors represent complex types particularly elegantly. A Value Vector can consist of values of a primitive type or complex types. When the underlying type is a complex type, the value vector holds the value in another value vector, which in turn could hold any primitive or complex type.
While most Value Vectors will hold values of a single type, Value Vectors go one step further and support a Union Value Vector that can hold values of many types together.
Flexibility of this kind allows data processing engines to deal with unstructured and complex data, especially the data that is produced by internet-enabled devices. Value Vectors can handle missing fields, fields that appear late in the data set, and even fields that change type midway through the data set.
For more details on the original design goals, go here.
Value Vectors in Apache Drill
Value Vectors form the basis of all in-memory operations in Drill, and are particularly well adapted for the needs of a schema-free query engine like Drill. Every Drill operator can process data efficiently, because Value Vectors allow random access with no overhead. Drill operators avoid expensive copy operations by using the ability of Value Vectors to provide subsets without making copies. Drill’s RPC protocol for communication between Drillbits and between Drillbits and clients transmits data using the Value Vectors’ format. Drill’s exchange operators transmit data from one Drillbit to another without any conversion overhead. This is because the Value Vector format allows the underlying buffer to be transmitted without any overhead, while the Drill Java and C++ client both deserialize data sent as Value Vectors.
Perhaps the most compelling use of Value Vectors in Drill is the ability to read data from formats like JSON that are inherently complex document formats, have mutating schema, and can mostly be read only in serialized fashion. By reading complex JSON into Value Vectors, Drill can process the same data in columnar form, execute the query in highly parallel fashion, and seamlessly manage the mutating schema of a JSON document.
Integrating Complex Projects
One of the early challenges for the Drill team was integration with other projects that also dealt with complex and columnar data. Since there was no standard for in-memory representation, each project read data into memory in its own representation and provided APIs that allow other projects to access their data. This creates a barrier to the integration of otherwise complementary projects, as adapters have to be written to convert from one format to another. Worse, the conversion from one format to another is an expensive operation that takes precious memory and CPU time.
With the availability of Apache Arrow, projects that share complementary goals can integrate very efficiently. An example of a project that is already underway is a vectorized reader for Parquet files. This effort, part of the Apache Parquet project along with the Apache Spark and Apache Drill teams, will provide efficient reading of Parquet files into Value Vectors, and projects like Apache Drill and Apache Spark will take advantage of this effort. Down the road, the integration of Drill and Spark itself will become more efficient, and the data interchange between the two will require no additional processing.
Because Value Vectors are easily serialized, they can even be used to persist data temporarily to disk or to exchange data over a network. It will be interesting to see the uses the Open Source community comes up with for this format.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.