Hadoop Cloudera or Others

(An open-source software framework for storing data and running applications on clusters of commodity hardware)


Hadoop Cloudera or Others

(An open-source software framework for storing data and running applications on clusters of commodity hardware)


The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Features of Hadoop

Support for very large Files: 

BigData have Gb’s even Tb’s & Pb’s of file size which is very difficult to process for analysis/analytics purpose. Hadoop with its multiple parallel processing can handle such files very easily and produce results within few hrs.

Commodity Hardware: 

No high-end hardware or software is required for storing and processing such Big Files. Hadoop is not a client server architecture rather it’s a master slave architecture. Even low-end hardware & software is sufficient to handle.

Sequential File Accessing / Streaming Access:

Through Hadoop we can easily analyze the streaming data which comes from live feed within minutes.

Fault Tolerance: 

Resilient to failure. A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.

High Availability: 

Due to master slave nature of architecture Hadoop provides high availability with its new feature called secondary master which is in sync with master and thus in event of failure of master secondary acts as master providing seamless processing.


Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Thus, adding extra nodes whenever it is required without modifying any existing components.

Apache Hadoop framework

Hadoop Common: contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System (HDFS): a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster

Hadoop YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications

Hadoop MapReduce: a programming model for large scale data processing