Hadoop Cluster Automation with Ansible!!

Himanshi Kabra
5 min readDec 10, 2020

What is ansible ???

Ansible is a configuration management tool design on the top of python. It provides a way to automate the things . In simple term you can run the desired tasks remotely on a different operating system just be a single click. It helps in creating playbook which when run performs the functions you need.

What is hadoop ?

Apache Hadoop is software which works on master- slave topology . It is used to overcome the issues of big data by distributing the storage in its slave . The master also known as namenode and the slaves are known as datanodes.

1.NameNode❕❕

The NameNode is the centerpiece of an HDFS file system in Hadoop. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.The NameNode is a Single Point of Failure for the HDFS Cluster

2.DataNode❕❕

DataNodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized computation can be executed near the data. Within a cluster, DataNodes should be uniform

3.Client❕❕

Client in Hadoop refers to the Interface used to communicate with the Hadoop Filesystem. There are different type of Clients available with Hadoop to perform different tasks. The basic filesystem client hdfs dfs is used to connect to a Hadoop Filesystem and perform basic file related tasks.

How Ansible Works❓❓

Ansible works by connecting to your nodes and pushing out small programs, called “Ansible modules” to them. … Ansible then executes these modules , and removes them when finished. Library of modules can reside on any machine, and there are no servers, daemons, or databases required. Ansible has its playbook concept to carry out multiple management related tasks. An Ansible playbook contains one or multiple plays, each of which define the work to be done for a configuration on a managed server. Ansible playbooks are written in YAML.

############################################################

Task Description : Configure hadoop and start cluster services using Ansible Playbook.

In this task we are going to configure hadoop namenode and datanode on our managed nodes using Ansible playbook that run on our controller node.

This is our inventory file where we have grouped our managed nodes into three groups namely : namenode , datanode and Client.

Here is our managed node where we can see hadoop is not installed

In our playbook we can see here that to all managed nodes we are copying and installing jdk and hadoop softwares. Here we are using OS specific command for Redhat systems.

NAMENODE,DATANODE,CLIENT

• Our next step is to setup namenode on one of our managed nodes. For this we have to create hdfs-site.xml file with namenode directory being created dynamically.

Similarly, we will create hdfs-site.xml file for datanode and then copy them to the respective managed nodes using template module.

Next, we have to create core-site.xml file where it gets IP of namenode dynamically from groups.

• Ansible playbook for configuring files :

NAMENODE

DATANODE

CLIENT

In our next tasks we format the namenode and then start the services of both namenode and datanode.

• Running the playbook with following command :

# ansible-playbook hadoop.yml

• After running this playbook, we see that both softwares have been copied to managed nodes.

• By #hadoop version command we can see that now hadoop has been installed.

--

--