The Network File System (NFS) protocol provides remote access to shared disks across networks. An NFS-enabled server can share directories and files with clients, allowing users and programs to access data on remote systems as if they were stored locally. NFS has become a well-established industry standard and a widely used interface. It provides important advantages including greater interoperability between distinct systems, easier data sharing across multiple users and applications, and flexibility from the physical decoupling of storage from compute resources.
The MapR Converged Data Platform is the only big data platform that provides the full power of NFS. MapR Direct Access NFS™ offers usability and interoperability advantages, and makes big data radically easier and less expensive to use. MapR allows files to be modified and overwritten at high speeds in real time from remote servers via an NFS mount, and enables multiple concurrent reads and writes on any file. Here are some examples of how MapR customers have leveraged NFS in their production environments:
MapR Direct Access NFSTM makes big data radically easier and less expensive to use. MapR allows files to be modified and overwritten, and enables multiple concurrent reads and writes on any file. Here are some examples of how MapR customers have leveraged NFS in their production environments:
You can easily load data in real time into MapR by mounting the MapR cluster from a remote node via NFS and using operating system tools for copying the data into the NFS mount point. No coding or special commands are required, and data can be loaded from any server that supports NFS, including hardware running Microsoft Windows. This even works well in a high speed, high volume environment. NFS is fast and reliable on MapR and is commonly used by customers in production environments
USE OF EXISTING APPLICATIONS
From an application standpoint, accessing data in MapR over NFS works identically to accessing data on a local drive. This means MapR can be used as a real-time network-attached storage (NAS) device for applications and systems that read and write data. Third-party engines like Nginx, MySQL, and HP Vertica are certified to run on MapR with no special plugins or customizations, treating MapR as a NAS for storing persistent data.
NFS VERSUS THE HDFS API
With its long history of handling Apache™ Hadoop® workloads, MapR naturally supports the Hadoop Distributed File System (HDFS) API in addition to NFS. The HDFS API is used by many open source tools for big data such as Apache Drill, Apache Hive™, Apache Sqoop™, etc. To access a MapR cluster via the HDFS API, the MapR Client must be installed on the client node. MapR provides easy-to-install clients for Linux, Mac, and Windows. The HDFS API is built on Java, so in most cases client applications are developed in Java and linked to the Hadoop-core-*.jar library.
Support for both NFS in addition to the HDFS API appears redundant, but in MapR, they serve two separate purposes. The HDFS API has some advantages with regard to security and parallelized data access, and is preferred for applications that support it. NFS, on the other hand, is ideal for all other applications that read and write files using standard operating system calls. As mentioned earlier, NFS on MapR provides usability and interoperability advantages. For data access that combines the seamlessness of NFS with the advantages of the HDFS API, the MapR POSIX Client is the ideal technology (discussed later in this paper).
NFS was not initially used as part of Hadoop because of the "append-only" limitation in HDFS. Since the NFS protocol does not guarantee file packets are delivered in order, it is very inefficient to implement NFS on top of HDFS. So while an NFS interface was later added to HDFS, it is completely different than the implementation in MapR. The differences and the limitations of the HDFS implementation are discussed later in this document.
Each node in the MapR cluster has a FileServer service, whose role is similar in many ways to the DataNode in the Hadoop Distributed File System (HDFS). In addition, there can be one or more NFS Gateway services running in the cluster. In many deployments the NFS Gateway service runs on every node in the cluster, alongside the FileServer service. See Figure 1.
To access a MapR cluster over NFS, the NFS client on the remote node mounts any of the NFS Gateway servers in the MapR cluster. There is no need to install any software on the client node, because every common operating system (except for some lower-end Windows versions) includes an NFS client. In Windows, the MapR cluster becomes a drive letter (e.g., M:, Z:, etc.), whereas in Linux and Mac the cluster is accessible as a directory in the local file system (e.g., /mapr).
The MapR Platform supports random reads and writes and multiple simultaneous readers and writers. This provides a significant advantage over HDFS-based data platforms, which only provide an append-only storage system (similar to a CD-ROM). See Figure 2.
Support for random reads and writes is necessary to provide true NFS access, and more generally, any kind of access for non-Hadoop applications. In a MapR cluster, the NFS Gateway service receives requests from the client and translates them into the corresponding RPCs to the FileServer services.
MapR POSIX Client is an add-on product that provides seamless data access to MapR from remote nodes, just like with NFS. It gives you the added benefits of authentication, encrypted transmission, compressed transmission, and parallelized communications. Application servers, web servers, and other applications/systems can read and write directly and securely to a MapR cluster with significantly faster throughput. The MapR POSIX Client leverages the "Filesystem in Userspace" (FUSE) interface to provide the seamless data access, so it is technically not NFS, but behaves exactly the same way from the user/application perspective. An earlier version of the MapR POSIX Client used NFS (the "MapR POSIX Loopback NFS Client"), but that version has been deprecated in favor of the FUSE-based implementation. See Figure 3.
The MapR POSIX Client is available in two versions, Basic and Platinum. The only difference is the maximum throughput capabilities, where Basic supports up to 1 GB/s throughput, and Platinum supports up to 3 GB/s. As a comparison, MapR Direct Access NFS supports up to 500 MB/s.
HDFS-based data platforms claim to support NFS, which is technically true, but not practically true. As alluded to in an earlier section of this document, the HDFS implementation of NFS has significant limitations. Specifically, it does not provide performance and ease-of-use for realtime production environments. When importing a file using the HDFS implementation of NFS, the system first writes the file in its entirety to the node’s local file system. For this to work, there needs to be enough local file system space to store the entire file. This can be a problem if you are trying to import many large files simultaneously. This intermediary step is necessary to accommodate the NFS protocol, which is stateless on the server side. This means that every request over NFS is treated independently of the other requests. Therefore, packets that the NFS server receives from the client are almost always certainly out of sequence, which does not fit well with how HDFS expects its writes to be sequential. Properly aggregating all write operations on a file requires a read/write file system, thus the temporary staging file on the local file system is required.
The system then imports that temporary file into HDFS. This overhead of creating a temporary staging file, and the monitoring of free local disk space, make the HDFS implementation of NFS impractical as a means of interchanging data in a production environment.
Because of this bottleneck, NFS on HDFS cannot support multiple users, because the gateway may run out of local disk space very quickly. Also, the system performance becomes unusable because all NFS traffic is staged on the gateway’s local disks. In fact, the HDFS documentation recommends using the HDFS API and WebHDFS "when performance matters."
The MapR Converged Data Platform uniquely provides a robust, enterprise-grade data store that leverages commodity hardware in a massively scalable architecture. It exposes the standard NFS interface so that remote nodes, applications, and systems get full read and write access to data. This capability makes big data and all of its associated tools (including Hadoop and Spark) much easier to use, and enables new classes of applications and environments that need to interoperate easily with MapR.