Hadoop Core Components : HDFS and MapReduce

Let’s talk about components of Hadoop. Hadoop as a whole distribution provides only two core components and HDFS (Hadoop Distributed File System – Storage component ) and MapReduce (which is a distributed batch processing framework – processing component ), and a bunch of machines which are running HDFS and MapReduce are known as Hadoop Cluster.

You can add more nodes in Hadoop Cluster the performance of your cluster will increase which means that Hadoop is Horizontally Scalable.

HDFS – Hadoop Distributed File System (Storage Component)

HDFS is a distributed file system which stores the data in distributed manner. Rather than storing a complete file it divides a file into small blocks (of 64 or 128 MB size) and distributes them across the cluster. Each blocks is replicated (3 times as per default configuration – replication factor ) multiple times and is stored on different nodes to ensure data availability in case of node failure. Normally HDFS can be installed on native file systems like xfs, ext3 or ext4 .

You can write file and read file from HDFS. You cannot updated any file on HDFS. Recently Hadoop has added the support of appending content to the file which was not there in previous releases.

Here are some examples of HDFS commands.

Get list of all HDFS directories under /data/

$ hdfs dfs -ls /data/

Create a directory on HDFS under /data directory

$ hdfs dfs -mkdir /data/hadoopdevopsconsulting

Copy file from current local directory to HDFS directory /data/hadoopdevopsconsulting

$ hdfs dfs -copyFromLocal ./readme.txt /data/hadoopdevopsconsulting

View content of file from HDFS directory /data/hadoopdevopsconsulting

$ hdfs dfs -cat /data/hadoopdevopsconsulting/readme.txt

Delete a file or directory from HDFS directory /data/hadoopdevopsconsulting

$ hdfs dfs -rm /data/hadoopdevopsconsulting/readme.txt

Examples:

[admin@hadoopdevopsconsulting ~]# hdfs dfs -ls /data/
Found 1 items
drwxr-xr-x - admin supergroup 0 2016-08-29 11:46 /data/ABC
[admin@hadoopdevopsconsulting ~]# hdfs dfs -mkdir /data/hadoopdevopsconsulting
[admin@hadoopdevopsconsulting ~]# hdfs dfs -ls /data/
Found 2 items
drwxr-xr-x - admin supergroup 0 2016-08-29 11:46 /data/ABC
drwxr-xr-x - admin supergroup 0 2016-08-29 11:54 /data/hadoopdevopsconsulting
[admin@hadoopdevopsconsulting ~]# hdfs dfs -copyFromLocal readme.txt /data/hadoopdevopsconsulting/
[admin@hadoopdevopsconsulting ~]# hdfs dfs -ls /data/hadoopdevopsconsulting/
Found 1 items
-rw-r--r-- 2 admin supergroup 57 2016-08-29 11:54 /data/hadoopdevopsconsulting/readme.txt
[admin@hadoopdevopsconsulting ~]# hdfs dfs -cat /data/hadoopdevopsconsulting/readme.txt
Hi This is hadoop command demo by hadoopdevopsconsulting
[admin@hadoopdevopsconsulting ~]# hdfs dfs -rm /data/hadoopdevopsconsulting/readme.txt
16/08/29 11:55:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /data/hadoopdevopsconsulting/readme.txt
[admin@hadoopdevopsconsulting ~]# hdfs dfs -ls /data/hadoopdevopsconsulting/
[admin@hadoopdevopsconsulting ~]#

MR
Data block distribution in hdfs

MapReduce:

MapReduce is the algorithm of executing any task on distributed system. Using MapReduce one can process a large file in parallel. MapReduce framework executes any task on different nodes (slaves ) as full file is distributed across the cluster in a form of various blocks.

It has two phases, Map(Mapper Task) and Reduce (Reducer Task)

  • Each of these tasks would run on individual blocks of the data
  • First mapper task would take each line of elements as an input and generates intermediate key value pairs
  • Each mapper task is executed on a single block of data
  • Than reducer task will take list of key value pairs for same keys, process the data and generates the final output
  • A phase called shuffle and sort will take place between mapper and reducer task will send the data to proper reducer tasks
  • Shuffle process maps the mapper output with the same key to the collection of values as a value
    • For example (key1, val1) and (key1, val2) will be converted to (key1, [val1, val2])
  • The mapper and reducer tasks would in parallel
  • The reducer tasks can start their work as soon as mapper tasks are completed

MapReduce

Hadoop Core Concepts

Lets talk about Hadoop Core Concepts.

  • Distributed system design
  • How data is distributed across multiple systems ?
  • Different components involved and how they communicate with each others ?

Let’s deep dive and learn about core concepts of Hadoop and it’s architecture.

  • Hadoop Architecture distributes data across the cluster nodes by splitting it into small blocks (64 MB or 128 MB depending upon the configurations). Every node tries to work on individual block stored locally(data Locality ). Hence no data transfer is required during the processing most of the time.
  • Every time when Hadoop process the data, each node connects to other node as much less as possible. So most of the time node only deals with the data which is stored locally. This concept is known as “Data Locality”.
  • As we know data is distributed across the nodes, so in order to increase data availability each block of the data is replicated (as per the configuration/replication factor ) over different nodes. This method would help Hadoop to handle partial failures in the cluster.
  • Whenever MapReduce job (consist to typically two tasks Map Task and Reduce Task) is executed, Map tasks are executed on individual data blocks on each node (in most of the cases) and leverage “Data Locality”. This is how multiple nodes process data in parallel manner. It makes processing faster (bacause of parallel processing ) than traditional distributed systems.
  • If any node fails in between, the master will detect this failure and assign the same task to another node where the replica of the same data block is available.
  • If any failed node restarts, it automatically joins back to the cluster and than after master can assign it a new task whenever is required.
  • In the execution process of any job, if master detects any task running slowly on any node compare to other nodes, it will allocate the same redundant task to other node to make overall execution faster. This process is known as “Speculative Execution
  • The Hadoop jobs are written in a high level language like Java and Python, and the developer do not have any control over network programming or component level programming. You just have to focus on core logic other things would be taken care by Hadoop framework itself.

 

Why Hadoop ?

Apache Hadoop solves different kind of problems in Big Data world. So before we get an introduction of Hadoop it becomes necessary to understand the core problems in large scale computation. Than after we’ll try to understand how Hadoop solves these problems.

Pain points:

Distributed systems requirements ?

Following are some points about distributed system computing.

Growing data size

Today many organizations are generating far more data than earlier,
Organizations like Twitter and Facebook are generating data at a rate of terabytes/peta bytes  per day.
Paint Point: How to process this data ?

Processing of huge amount of data

  • Our traditional model of programming and computation is designed to process relatively small amount of data
  • Computation time is also highly dependent on CPU processing power and RAM
  • Whenever we require more processing power we have to upgrade CPU and RAM (which is not cost effective ) and improve the scalability (Vertical Scaling)

Pain Point : How to deal with vertical scalability issue?

Complexity in distributed processing

  • Data synchronization between machines
  • Data needs to be transferred to all compute nodes for computation purpose so more bandwidth is required(High network utilization )
  • Failure of one or more components ()

Pain point: How to over come above limitations ?

Hadoop provides new model of distributed computing which solves all of the above pain points .

Hadoop is the open source project which solves issues faced in traditional distributed computing. It is completely based on the concept of Google File System and MapReduce. The core concept of Hadoop is to divide data into small chunks and distribute them to all cluster nodes. And when computation process is executed on the cluster it process the data chunk which resides on that node. This approach is called “Data Locality” which reduces the consumption of cluster bandwidth by reducing the data transfers across the nodes which was done in our traditional distributed processing model.
Follwing are some concepts:

Horizontal Scalability:

We can dynamically add /remove a new computation node in order to scale the system for more computations and/or for more computation performance without system failure.

Partial Failure/Node Failure:

Hadoop be capable to handle the failure of one or more nodes from cluster. It will decrease compute capacity of the overall system but would not result in full system failure.

Consistency

Output of computation will be consistant in case of all nodes available or if some nodes are down.

Recovery of compute nodes

If any node is failed and recovered after sometime, hadoop is able to add that node without full system restart.

 

Deploying Openstack in lab

Open source software for creating private and public clouds. OpenStack software controls large pools of compute, storage, and networking resources throughout a datacenter, managed through a dashboard or via the OpenStack API.
This Post describes installing the Liberty release on centos 7.2. Followings are the steps

1.If you are using non-English locale make sure your /etc/environment is populated:

LANG=en_US.utf-8
LC_ALL=en_US.utf-8

2. Network Settings

For having external network access to the server and instances, this is a good moment to properly configure your network settings. A static IP address to your network card, and disabling NetworkManager are good ideas.

$ sudo systemctl disable firewalld
$ sudo systemctl stop firewalld
$ sudo systemctl disable NetworkManager
$ sudo systemctl stop NetworkManager
$ sudo systemctl enable network
$ sudo systemctl start network

3.Checking centos release . it should be centos7.2

[root@openstack-lab ~]# cat /etc/redhat-release

CentOS Linux release 7.2.1511 (Core)

[root@openstack-lab ~]#

4. Selinux should be disabled.

aaeaaqaaaaaaaaezaaaajgzkmwfimjeylta0otqtndc4zs1izdi2ltc5yjdmn2jmzjg4ng

you can disable seliunx by following command.

$ setenforce 0

you can check it by following command

[root@openstack-lab ~]# getenforce 
Permissive
[root@openstack-lab ~]# 

5. Host should get resolved , either through DNS or by adding entry in /etc/hosts

In my case I have added entry in /etc/hosts


[root@openstack-lab ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.168.85  openstack-lab
[root@openstack-lab ~]# ping openstack-lab
PING openstack-lab (192.168.168.85) 56(84) bytes of data.
64 bytes from openstack-lab (192.168.168.85): icmp_seq=1 ttl=64 time=0.046 ms
64 bytes from openstack-lab (192.168.168.85): icmp_seq=2 ttl=64 time=0.061 ms
^C
--- openstack-lab ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.046/0.053/0.061/0.010 ms
[root@openstack-lab ~]# 

Now you can run following commands to install openstack

$ sudo yum install -y centos-release-openstack-liberty
$ sudo yum update -y
$ sudo yum install -y openstack-packstack
$ packstack --allinone

If everything is fine you will be able to access your openstack dashboard as shown below.
aaeaaqaaaaaaaadeaaaajgezmgiyztkwlwfmywytndi5zs1hnzczlwyzythiythhmjk0na