Your database schema defines the data in your tables and how they are related. The choices you make when specifying how to arrange your data in keys and columns, and how those columns are grouped in families, can have a significant effect on query performance.
Schemas for MapR-DB tables follow the same general principles as schemas for standard Apache HBase tables, with one important difference. Because MapR-DB tables can use up to 64 column families, you can make more extensive use of the advantages of column families:
- Segregate related data into column families for more efficient queries
- Optimize column-family specific parameters to tune for performance
- Group related data for more efficient compression
Naming your identifiers: Because the names for the row key, column family, and column identifiers are associated with every value in a table, these identifiers are replicated potentially billions of times across your tables. Keeping these names short can have a significant effect on your tables' storage demands.
Access times for MapR-DB tables are fastest when a single record is looked up based on the full row key. Partial scans within a column family are more demanding on the cluster's resources. A full-table scan is the least efficient way to retrieve data from your table.
Row key design
Because records in Apache HBase tables are stored in lexicographical order, using a sequential generation method for row keys can lead to a hot spot problem. As new rows are created, the table splits. Since the new records are still being created sequentially, all the new entries are still directed to a single node until the next split, and so on. In addition to concentrating activity on a single region, all the other splits remain at half their maximum size.
With MapR-DB tables, the cluster handles sequential keys and table splits to keep potential hotspots moving across nodes, decreasing the intensity and performance impact of the hotspot.
To spread write and insert activity across the cluster, you can randomize sequentially generated keys by hashing the keys, inverting the byte order. Note that these strategies come with trade-offs. Hashing keys, for example, makes table scans for key subranges inefficient, since the subrange is spread across the cluster.
Instead of hashing the entire key, you can salt the key by prepending a few bytes of the hash to the actual key. For a key based on a timestamp, for instance, a timestamp value of
1364248490 has an MD5 hash that ends with
ffe5. By making the key for that row
ffe51364248490, you avoid hotspotting. Since the first four digits are known to be the hash salt, you can derive the original timestamp by dropping those digits.
Be aware that a row key is immutable once created, and cannot be renamed. To change a row key's name, the original row must be deleted and then re-created with the new name.
Rows in a MapR-DB table can only have a single row key. You can create composite keys to approximate multiple keys in a table. A composite key contains several individual fields joined together, for example userID and applicationID. You can then scan for the specific segments of the composite row key that represent the original, individual field.
Because rows are stored in sorted order, you can affect the results of the sort by changing the ordering of the fields that make up the composite row key. For example, if your application IDs are generated sequentially but your user IDs are not, using a composite key of userID+applicationID will store all rows with the same user ID closely together. If you know the userID for which you want to retrieve rows, you can specify the first userID row and the first userID+1 row as the start and stop rows for your scan, then retrieve the rows you're interested in without scanning the entire table.
When designing a composite key, consider how the data will be queried during production use. Place the fields that will be queried the most often towards the front of the composite key, bearing in mind that sequential keys will generate hot spotting.
Column family design
Scanning an entire table for matches can be very performance-intensive. Column families enable you to group related sets of data and restrict queries to a defined subset, leading to better performance. When you design a column family, think about what kinds of queries are going to be used the most often, and group your columns accordingly.
You can specify compression settings for individual column families, which lets you choose the settings that prioritize speed of access or efficient use of disk space, according to your needs.
Be aware of the approximate number of rows in your column families. This property is called the column family's cardinality. When column families in the same table have very disparate cardinalities, the sparser table's data can be spread out across multiple nodes, due to the denser table requiring more splits. Scans on the sparser column family can take longer due to this effect. For example, consider a table that lists products across a small range of model numbers, but with a row for the unique serial numbers for each individual product manufactured within a given model. Such a table will have a very large difference in cardinality between a column family that relates to the model number compared to a column family that relates to the serial number. Scans on the model-number column family will have to range across the cluster, since the frequent splits required by the comparatively large numbers of serial-number rows will spread the model-number rows out across many regions on many nodes.
For a list of the properties that you can set when you create a column family, see the documentation for the
table cf create.
MapR-DB tables split at the row level, not the column level. For this reason, extremely wide tables with very large numbers of columns can sometimes reach the recommended size for a table split at a comparatively small number of rows.
In general, design your schema to prioritize more rows and fewer columns.
Because MapR-DB tables are sparse, you can add columns to a table at any time. Null columns for a given row don't take up any storage space.