Concept Of Parallelism In Hadoop Upload — A myth

Himanshi Kabra
3 min readSep 7, 2022

→ Is Hadoop-Client Uploads the data serially or in parallel format to the DataNodes? Find Out? ←

🔥Let’s Research and the world should know about the Myths of Hadoop🔥

TASK-DESCRIPTION:-

🔷According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

👉🏻 Research with your teams and conclude this statement with proper proof

✴️Hint: tcpdump

*******************************************************************

>>tcpdump is a most powerful and widely used command-line package analyzer tool which is used to capture or filter TCP/IP packets that recieved or transferred over a network on a specific interface. It also gives us a option to save captured packets in a file for future analysis.

Let’s start to perform the task:-

  1. Create an account on AWS
  2. Launch four Ec2 instance on AWS
  3. Configure one instance as NameNode, one as Client and remaining two as DataNodes.
  4. Install jdk and hadoop package in all instances
  5. Configure “hdfs-site.xml” and “core-site.xml” file in both datanodes and namenode. (Reminder, no need to configure “hdfs-site.xml” file in hadoop-client , only configure “core-site.xml” file)
  6. Format the NameNode
  7. Start the Hadoop daemon services in both DataNodes and NameNode and check by using “jps” command
  8. Check Datanodes available to the Hadoop-Cluster by using command “#hadoop dfsadmin -report”
  9. Hadoop-Client uploads a file to the Hadoop-Cluster by using command:-

# hadoop fs -put <file_name> /

10. Check file in hadoop-cluster by using command:-

# hadoop fs -ls /

11. While uploading file , RUN tcpdump command in NAMENODE and in both DATANODES :

>>Firstly, install tcpdump package —

# yum install tcpdump

>>Run tcpdump command to check the transferred packets between client , master and slaves —

# tcpdump -i eth0 -n -x

# tcpdump -i eth0 tcp port 22 -n

It will shows that which one is requesting and which one is replying. While running the above command, you will get to know that In NameNode, Client is requesting Master(or NameNode) to get the IP-ADDRESSES of DataNodes as Client is the one who directly uploads the data to the DataNodes and Master is replying by sending the network packets to the Client which contains the IP-addresses of DataNodes.

>>To trace the DATA PACKETS (or data flow) → Use port no.50010 , run command —

# tcpdump -i eth0 port 50010 -n -x

>>While running this command in both DATANODES and in NAMENODE , you will see some data packets are recieving to the DataNodes. These data packets are recieved by datanodes in such a manner that Firstly , some packets recieved by DataNode1 and then, stops . After this, some packets recieved by DataNode2 and then, stops. Again data packets recieved to DN1 and when it stops ,then ,recieved by DN2 ……….so on . This process goes on till the whole file gets uploaded to the Hadoop-Cluster. This will help to uploads the data fastly or you can also check its time-stamp during uploadation of file in both the slaves , it will definitely differ as it is not transferring data in parallel.

DataNode-1

DataNode-2

Thus, The Data Flow in data packets from CLIENT to the DATANODES is Completely visible in the serial order. So, We can say that Hadoop uses the concept of “serialism” to upload the split data while fulfilling Velocity problem.

HENCE, Proved.

Thank You!!!!!!

--

--