Hadoop comes with its own distributed filesystem, HDFS. As the use of the internet is increasing, we are producing data at astonishing rates. With the increase in the size of data, we face issues of storage as well as processing such huge datasets. We can increase the limit of a single machine to a finite extent (Vertical scaling). It becomes costly and unreliable to use the single device after some point. This is when it becomes necessary to partition data across many machines. Such file storage systems are known as distributed file systems.
HDFS stands for Hadoop distributed filesystem. It is designed to store and process huge datasets reliable, fault-tolerant and in a cost-effective manner. HDFS helps Hadoop to achieve these features. In this article, we are going to take 1000 foot overview of HDFS and what makes it better than other distributed filesystems.
The Design Goals of HDFS
The Hadoop distributed filesystem has significant advantages over other distributed filesystems. Following are design goals which make it suitable for storing and processing massive datasets.
Hadoop distributed filesystem is designed to run on commonly available hardware. This makes HDFS much more cost-efficient and easy to scale out.
Write once read many times
Hadoop distributed filesystem is designed for datasets which are written once and read many times. After a file is written to HDFS, it should not be changed. This assumption enables high throughput data access and also works well with Map-reduce model.
Recovering Hardware Failures
Hardware failure is the norm rather than the exception. As we are using commodity hardware for HDFS, chances of hardware failure are high. But HDFS is designed in a way that it can recover from such failures quickly and maintain high availability. This is achieved by data replication. We are going to learn about it in the next tutorials.
HDFS is used for large datasets in the range of gigabytes or terabytes or even petabytes. It works well with the small number of large files.
Streaming Data Access
When we are performing analysis on HDFS data, it involves a large proportion, if not all, the dataset. So the time to read the whole dataset is more important than latency in reading the first record in case of Hadoop distributed filesystem. That is why HDFS focuses on high throughput data access than low latency.
Limitations of HDFS
Hadoop distributed filesystem works well for many of large datasets as the distributed filesystem. But we should know there are some limitations of HDFS which makes it a bad fit for some applications.
Low latency data access
Applications that require low latency data access, in range of milliseconds will not work well with HDFS. It is designed to provide high throughput at the expense of low latency.
Lots of small files
Namenode holds data about file location in HDFS cluster. If there are too many files, Namenode will not have enough memory to store such metadata about each file. We will learn about HDFS architecture in the next tutorial.
Arbitrary data modification
Hadoop distributed file system does not support updating data once it is written. We can append data to end of files but modifying arbitrary data is not possible in HDFS. Hence application which needs data to be changed will not work well with HDFS.
We have taken the overview of HDFS and what makes it efficient than other distributed filesystems. We have also seen some limitations to it. We will start going in details of HDFS in the next tutorials. You can read about HDFS architecture in this post.
I am passionate about data analytics, machine learning, and artificial intelligence. Recently I have started blogging about my experience while learning these exciting technologies.