13 min read
Data has become the new oil, enabling enormous change in commerce and industry. Companies today are working with petabytes of data that can be historic, generated in real time, or undergoing rapid or constant change; they generate insights with machine learning, SQL, graph processing, and distributed algorithms, using a wide-variety of open source tools. They are building complex and integrated data pipelines and bringing together massive amounts of data from a wide variety of sources. The easy access to vast amounts of data and cheap processing is opening up new applications/use cases that wouldn’t have been possible earlier. For example, in the auto industry, self-driving cars require analysis of months of driving data (video feeds, sensor data) from a variety of driving conditions (highways, busy streets, pedestrians, nighttime, rain/snow, etc.). This collection and use of data would not have been possible without the big data systems of today. Data systems are like the oil that fueled economic expansion in the 20th century.
Companies like Facebook, Google, Yahoo, and LinkedIn led the wave and developed massive, distributed data-processing frameworks internally. Many of these frameworks were later open-sourced. For example, Apache Hadoop and Apache Kafka have become extremely popular with the broad developer community.
Hadoop was among the first distributed data-processing products to be broadly adopted. It enabled users to crunch through massive amounts of data on commodity hardware using MapReduce.
Specifically, Hadoop had HDFS as the distributed file system and MapReduce as the compute engine, which together allowed analysts to crunch through terabytes (and later petabytes) of data to provide rapid insight that would have otherwise taken days, weeks, or sometimes just not been possible with traditional systems.
Initially, Hadoop was being used as a ‘static’ database, which worked great – until the need outgrew Hadoop’s capability. As users saw the need for doing analytics on data that is evolving, static files in HDFS didn’t suffice. The need for a database for big data emerged. This was addressed by Apache HBase and multiple other stand-alone NoSQL databases, like Cassandra and Couchbase. Each made different trade-offs to accomplish scale-out.
You have probably heard the adage: The three most important features of a database are 1) performance, 2) performance, and 3) performance. There is a lot of truth to that – no matter how rich the functionality of your database, users will not adopt it if it doesn’t provide high performance.
Broadly speaking, users look for these features in a big data system:
Strict Data Consistency: For a mission-critical application, data loss is unacceptable. Given that NoSQL databases run on commodity hardware, it is expected that hardware and software failures are relatively common. A big data system must expect and overcome expected hardware and software failures.
Strict data consistency in the database is essential for an enterprise to develop mission-critical applications. This allows the application developer to focus on the business logic, instead of worrying about data consistency and failure scenarios. The database system in conjunction with the underlying distributed platform must ensure data consistency with minimal impact to latency in all cases. Database systems that trade off consistency for performance or scale limit the type of applications that can be built.
To ensure no data loss in spite of a complete node or rack failure, databases ensure that any update to the database is replicated to multiple nodes on different racks before being acknowledged to the client. One of the nodes serves as the master for serving the data at all times. In case of failure, a new node is elected as the master. The data is sharded across multiple nodes to minimize impact due to the failure of a single node.
Performance and Strong SLAs: Database activity falls into three basic types: a) load or ingest of data: either loading historical data from files or data coming in from streaming sources; b) update data from operational systems; and c) access data from different compute engines on end-user dashboards.
In today’s 24x7 business, these activities are happening all the time and often overlap each other and are typically being performed by a large number of users. It is critical that the database provide extremely fast performance AND consistent SLAs. It is unacceptable for a business to see a drop in SLA on one application if other applications are also running at the same time.
YCSB (Yahoo! Cloud Serving Benchmark) is used as a gold standard for measuring the performance of big data systems. Real world workloads have a lot more variability than a typical benchmark; administrators have to ensure that the database is running on a hardware that is appropriately sized for the workload.
Architecturally, a database can achieve high performance and SLA support through a combination of hierarchical log-structured merge-trees (LSM trees) on disk and smart memory caches to a) minimize I/O during reads and b) consolidate I/O for writes and updates. Multi-threaded database servers with a lockless architecture prevent thread contention while proper design ensures that no single thread becomes a bottleneck, while still ensuring workloads can scale.
Data grows continuously, and the database has to ensure that a) the data is organized across nodes in small shards to retain high performance without hotspotting, b) data reorganization happens dynamically without any administrator involvement, and c) data growth and reorganization have no impact on end-application workload SLAs.
In an environment where multiple applications are pushing updates to the database, the database has to provide a consistent view of the data. For a NoSQL database, data is typically denormalized, and a row represents a complete unit of a typical transaction. Databases have to provide row-level atomicity. That is, under all conditions (failure, concurrent updates), the entire row is self-consistent.
To unlock the full potential of the data stored in the database, it needs to be very easy to get data in and out of the system by the application layer. There are a few dimensions to this.
In the big data world, a common use case is a multi-threaded middleware application, inserting and retrieving rows at large scale to be served to a front-end application. The database API needs to support fast, asynchronous and synchronous, primary key-based data ingest and retrieval.
As the complexity of application logic increases, the expectation is for the database to support rich query capabilities, like filters and aggregations. And to do it in a performant way. One key challenge is to ensure the application does not retrieve large portions of the data back to the application layer. This can have a devastating effect on the overall system performance and other concurrent workloads that might be running on the cluster.
Applications developers use a wide variety of programming languages. Often the choice is determined by the broader ecosystem the application resides in. A wide variety of languages allows multiple application use cases to be served from the same dataset. Otherwise, the user might be required to make a copy of the data to another system. Java, C, Python, node.js, and REST cover a vast majority of the needs. In addition, a simple network protocol can provide users and partners to add new languages easily.
In addition to serving applications, a big part of big data systems is to serve complex analytic queries. Products like Apache Drill and Apache Spark provide rich query functionality. Apache Drill, for example, can do complex ad hoc SQL queries and join data across tables, files, and Kafka topics.
For the query engine to execute queries quickly and efficiently, the query optimizer relies on a few primitives from the data layer: 1) a fast data access API with the ability to push down basic operations, like filters and aggregations, that can be accomplished more efficiently closer to the data, 2) secondary indexes on column(s) that are most frequently used as predicates in the query conditions, and 3) statistics about the data distribution that can aid in selecting the most efficient query plan.
It is good practice architecturally to decouple the data/storage NoSQL layer from the query layer as two independent distributed services.
First, by delegating the query capability to an independent query engine like Apache Drill, the NoSQL data layer can remain architecturally simple and more resilient. The distributed query layer can combine data from multiple sources and provide much richer query semantics. Second, this gives the administrator flexibility to scale the data and query layers independently, given the types of workloads. For example, a more analytic-heavy workload can be served by provisioning the query engine with more core and memory.
Integration with other analytic engines, like Apache Spark and Apache Hive, supports different use cases and analytics paradigms.
What I have described so far is a robust NoSQL database with rich application development and analytic capabilities. As your application moves from development to production, a few additional capabilities are very critical for production success in an enterprise.
Data is a critical enterprise asset and security of data is extremely critical. Core security capabilities, like authentication and data encryption, are typically provided as part of the larger data platform the database resides in. For data-access control, a database needs to provide fine-grained access control, so that administrators can limit which users have access to different subsets of the data. A more advanced capability would be to have the access controlled by tags or actual data values.
In a modern enterprise, data is expected to be updated and accessed from across the globe. Support for multi-master databases that reside in arbitrary cluster topologies across the globe is essential, both in the cloud or within the enterprise. This allows applications in every region to process data locally, while updates from anywhere in the globe are seamlessly transmitted to all locations. The underlying infrastructure should manage this reliably without any effort from the administrator or application developer.
Changes to data in a database often trigger business process in an enterprise. It could be as simple as notifying a data owner of the change or something more involved, such as pushing selective changes to another data system for specialized processing. Databases can provide a Change Data Capture (CDC) capability that allows for reliable and programmatic notification of downstream systems of changes in the database.
A database in the big data ecosystem has to provide capabilities similar to those we have gotten accustomed to in the traditional, relational world: data consistency, rich data access APIs and language bindings, and broad analytics-engine support with a focus on query optimization. In addition, for production deployment, fine-grained data access control and multi-master configurations are crucial for the enterprise. Given the scale of data, newer algorithms and approaches have to be baked into the architecture and design of the database.
Needless to say, without blazingly fast performance, none of these features would be compelling for an enterprise to use as their data platform.
Looking forward, containerized deployment of platforms and applications promise access to a limitless pool of resources that can be provisioned instantaneously. This can potentially add another dimension to workload management. Compute frameworks will also have to build deeper pushdown capabilities to allow faster processing.
There are a lot of exciting capabilities that will emerge to support the growing data scale and evolving data access patterns driven by AI/ML, analytics, and mission-critical applications of the future.
In subsequent blogs, you will learn about more details on the database internals.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.