11 min read
This is the first installment in our blog series about deep learning. In this series, we will discuss the deep learning technology, available frameworks/tools, and how to scale deep learning using big data architecture.
Machine learning is a branch of computer science that studies the algorithms that learn with the help of observational data but without explicit programming. In other words, the goal is to design algorithms that can learn automatically without the intervention of humans. In general, machine learning can be considered as a subfield of AI.
Machine learning, in a broader sense, can be categorized as:
In the data analytics field, both of these categories are heavily used, particularly with the advent of big data. As more and more data is generated and stored, predictive analytics or machine learning is used in autonomous vehicles, healthcare, banking, telecommunication, smart cities, smart factories, and across all the verticals to solve multiple use cases.
Neural Networks–or as they are more appropriately called, Artificial Neural Networks (ANN)–were invented in 1940 by McCulloch, Pitts, and Hebbian. ANN was largely based on the behavior of the axions found in the human brain. ANNs are composed of large number of highly interconnected neurons (a processing unit) that solves a specific problem by working in unison. ANNs learn to solve a specific problem by first learning through examples, like humans do. ANNs are problem specific, as their learning is restricted, but they can be applied to all sorts of classification or pattern recognition problems.
ANNs were famous for their remarkable ability to derive meaning from complicated and imprecise data. ANNs have been widely studied since their inception and have also been used widely. Over a period of time, their efficacy was questioned due to limitations posed by ANNs, namely the utility and hardware requirements. Utility of ANNs was restricted to solving “toy” problems, and they were not able to solve real and more complex problems. Hardware issues also restricted their usage as a large amount of memory and computational power was required to effectively implement large ANNs. There were other theoretical objections that ANNs do not necessarily reflect the function of neurons that are seen in the human brain. The restrictions posed by ANNs paved the way to different algorithms, such as Support Vector Machines, Random Forests, Compressed Sensing, etc., but recently research in the field of Deep Learning has again brought attention back to this field.
Deep Learning is part of a broader family of Machine Learning methods, which uses a cascaded structure of what is known as hidden layers of neural networks. The difference between shallow neural networks vs. deep neural networks is the number of hidden layers (i.e., in shallow neural networks, the number of hidden layers are few, while in deep neural networks, it is high). Although artificial neural networks (shallow) have been in existence since the 1940s, only recently has research taken off on deep learning. The reasons are threefold:
Deep learning has been made possible, thanks to the above factors. In recent years, Deep Learning has been used in nearly all possible applications: object recognition in images, automated machine translation, drug discovery, recommendation engines, biomedical informatics, NLP, etc.
Although deep learning inherently has properties of distributed computing, it has still been used on machines that have high-end GPUs. These multi-GPU single machines are very expensive, and as data gets larger, single machines aren’t capable of holding it. The big data world, and MapR in particular, brings the alternative of distributed computing (GPUs/CPUs), using commodity hardware. Peta/zeta bytes of data can be easily stored, hence providing a platform where deep learning can run at scale.
Multitudes of open-source libraries offer the deep learning functionality. The most prominent ones are listed below:
Among the available libraries, TensorFlow and Caffe provide a higher level of abstraction. This is helpful to people who lack an in-depth skill-set of deep learning or machine learning. A higher level of abstraction also enables the developers to transcend the intricacies of tuning the hidden layers or reinventing the wheel. TensorFlow is an open-source project from Google, while Caffe is from Berkley’s BVLC lab.
Deeplearning4j has been well known among the developers. However, presumably due to restricted language support other than Java and Scala, its popularity hasn’t grown.
TensorFlow is becoming more popular among the developers and the industry alike, as it provides a higher level of abstraction, is stable, and is perceived as production ready.
Apache Spark is the de-facto choice for a distributed computational framework for the big data world. It is an open-source framework that has been adopted across all verticals. The current Spark stack has Spark SQL, Spark Streaming, GraphX, and native MLLib, which is a library for conventional Machine Learning algorithms. Spark doesn’t have any Deep Learning Library in its stack, though.
Spark’s lack of support for Deep Learning on its stack has been a challenge for the big data community that is willing to explore Deep Learning frameworks. On a positive note, in October 2016, HDFS support was introduced by TensorFlow. Yet it needed separate clusters for running the application. This inherently comes at a higher cost of cluster installation, and at scale it means significant latency.
There are some initiatives started by open-source community to address the said limitations by binding TensorFlow on top of Spark framework. One such initiative is SparkNet, which launches TensorFlow networks on Spark executors. Recently, TensorFrames ( i.e., TensorFlow + DataFrames) was proposed, a seemingly great workaround, but in its current state, it is still in development mode, and migrating current TensorFlow projects to TensorFrames framework demands significant efforts. Another aspect is that Spark is a synchronous computing framework, and deep learning, on the other hand, is an asynchronous one; therefore, a significant change is imposed by TensorFrames on computing with TensorFlow.
A recently published framework called TensorFlowOnSpark (TFoS) addresses the above mentioned problems. TFoS enables execution of TensorFlow in distributed fashion on Spark and Hadoop clusters. TFoS can read data directly from HDFS via TensorFlow’s file readers or using QueueRunners and then feed it to a TensorFlow graph. This flexibility helps users of TensorFlow migrate to the big data environment relatively easily.
TFoS supports all sorts of TensorFlow programs. It also enables synchronous and asynchronous ways of training and inferencing. TFoS supports model parallelism and data parallelism. It also supports TensorFlow ecosystem tools, such as TensorBoard on a Spark Cluster. TFoS is expected to work seamlessly with Spark stack (i.e., SparkSQL, MLlib, etc.) The biggest advantage of TFoS is that the programming paradigm hasn’t changed for TensorFlow, and migration from TensorFlow to TFoS is easy. Another advantage of TFoS in terms of scaling is that the architecture is designed such that process-to-process communication doesn’t involve Spark drivers, which enables the program to scale easily.
It seems that TFoS provides a solution which is scalable and provides the advantage of a big data compute framework with the best available deep learning library, TensorFlow.
Recent advancement in container technology has given rise to microservices architecture. The same technology has also been used for implementation of customized applications on top of big data frameworks, as they are application agnostic. Deep learning can benefit from containerization technology. Deep learning library-based programs can be containerized and given access to petabytes of data, which reside on big data file systems such as MapR XD. To distribute the learning, multiple containers that are running on multiple machines can be orchestrated using technologies such as Kubernetes. Container-based scaling has a significant advantage over Spark-based scaling, because a Spark cluster beyond a certain point (>100 nodes) requires significant tuning, but tools such as Kubernetes use containers to address those limitations. Kubernetes via containers has the potential to exploit the heterogeneous hardware (GPUs and CPUs). We’ll be writing more about scaling deep learning using containers in the following blog.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.