*Big Data* is a term used for a collection of data sets that are large and complex, which is…

4 min readSep 17, 2020

*Big Data* is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.It is characterized by 5 V’s namely, Velocity, Volume, Veracity, Variety, Value.

*Importance of Big Data:*
Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.

So, Big Data is emerging as an opportunity for organizations. They have realized that they are getting lots of benefits by Big Data. They are examining large data sets to uncover all hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.
These analytical findings are helping organizations in more effective marketing, new revenue opportunities, better customer service. They are improving operational efficiency, competitive advantages over rival organizations and other business benefits.

*What is Hadoop?*
Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallely. There are basically two components in Hadoop:
1.HDFS for storage (Hadoop distributed File System), that allows you to store data of various formats across a cluster.
2. YARN, for resource management in Hadoop. It allows parallel processing over the data, i.e. stored across HDFS.

*Hadoop Architecture:*
The main components of HDFS are NameNode and DataNode.

NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.

DataNode
These are slaves nodes which runs on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.

For processing , we use YARN(Yet Another Resource Negotiator). The components of YARN are ResourceManager and NodeManager.

ResourceManager
It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.

NodeManager
It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date .

Map-Reduce
MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment.

*Hadoop as-a-solution:*

•HDFS provides a distributed way to store Big data. Your data is stored in blocks across the DataNodes and you can specify the size of blocks.
•With HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured. Since in HDFS, there is no pre-dumping schema validation. And it also follows write once and read many model. Due to this, you can just write the data once and you can read it multiple times for finding insights.
•We move processing to data and not data to processing. What does it mean? Instead of moving data to the master node and then processing it. In MapReduce, the processing logic is sent to the various slave nodes & then data is processed parallely across different slave nodes. Then the processed results are sent to the master node where the results is merged and the response is sent back to the client.

*Where is Hadoop used?*
Hadoop is used for:
*Search – Yahoo, Amazon
*Log processing – Facebook, Yahoo
*Data Warehouse -Facebook, AOL

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Himanshi Kabra

No responses yet