Dell EMC Hadoop Big Data Solution


  Hadoop Architecture and Cluster Deployment


 

Apache Hadoop is an open-source software framework used for distributed storage and processing of very large data sets. It consists of computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are a common occurrence and should be automatically handled by the framework.

 

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality – nodes manipulating the data they have access to – to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

 

The base Apache Hadoop framework is composed of the following modules:

 

  • Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  • Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications; and
  • Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.

 

 

 

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions.

 

Typically one machine in the cluster is designated as the NameNode and another machine the as ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load.  The rest of the machines in the cluster act as both DataNode and NodeManager. These are the slaves.

 

To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons execute as well as the configuration parameters for the Hadoop daemons.

 

HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN damones are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.

 

Configuring Environment of Hadoop Daemons

 

Administrators should use the etc/hadoop/hadoop-env.sh and optionally the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts to do site-specific customization of the Hadoop daemons’ process environment.

 

At the very least, you must specify the JAVA_HOME so that it is correctly defined on each remote node.

 

Administrators can configure individual daemons using more configuration options.

 

Configuring the Hadoop Daemons

 

The following configuration should be performed by Administrators:

  • Configuring the etc/hadoop/core-site.xml and etc/hadoop/hdfs-site.xml for some important paremeters.
  • Configuring NameNode and DataNode
  • Configuring ResourceManager and NodeManager
  • Configuring History Server
  • Configuring MapReduce Applications
  • Configuring MapReduce JobHistory Server

 

Monitoring Health of NodeManagers

 

Hadoop provides a mechanism by which administrators can configure the NodeManager to run an administrator supplied script periodically to determine if a node is healthy or not.

 

Administrators can determine if the node is in a healthy state by performing any checks of their choice in the script. If the script detects the node to be in an unhealthy state, it must print a line to standard output beginning with the string ERROR. The NodeManager spawns the script periodically and checks its output. If the script’s output contains the string ERROR, as described above, the node’s status is reported as unhealthy and the node is black-listed by the ResourceManager. No further tasks will be assigned to this node. However, the NodeManager continues to run the script, so that if the node becomes healthy again, it will be removed from the blacklisted nodes on the ResourceManager automatically. The node’s health along with the output of the script, if it is unhealthy, is available to the administrator in the ResourceManager web interface. The time since the node was healthy is also displayed on the web interface.

 

Hadoop Rack Awareness

 

Many Hadoop components are rack-aware and take advantage of the network topology for performance and safety. Hadoop daemons obtain the rack information of the slaves in the cluster by invoking an administrator configured module. See the Rack Awareness documentation for more specific information.

 

It is highly recommended configuring rack awareness prior to starting HDFS.

 

Logging

 

Hadoop uses the Apache log4j via the Apache Commons Logging framework for logging. Edit the etc/hadoop/log4j.properties file to customize the Hadoop daemons’ logging configuration (log-formats and so on).

 

Operating the Hadoop Cluster

 

Once all the necessary configuration is complete, distribute the files to the HADOOP_CONF_DIR directory on all the machines. This should be the same directory on all machines.

 

In general, it is recommended that HDFS and YARN run as separate users. In the majority of installations, HDFS processes execute as ‘hdfs’. YARN is typically using the ‘yarn’ account.

 

Hadoop Startup


Start the HDFS NameNode with the following command on the designated node as hdfs:



Start a HDFS DataNode with the following command on each designated node as hdfs:



Start the YARN with the following command, run on the designated ResourceManager as yarn:



Hadoop Shutdown

 

Stop the NameNode with the following command, run on the designated NameNode as hdfs:



Stop the ResourceManager with the following command, run on the designated ResourceManager as yarn:



Run a script to stop a DataNode as hdfs:

 

 

For more detail help, please contact Dell-EMC Expert

 

 

Resources

Welcome to The Apache Software Foundation!

 

 

 

Follow us on Twitter:
EMCOracle.png

Tweet this document:



Click here to learn more:

store_open.png

 

facebook_button-30.pngtwitter_button-25.pngemail_button-30.pnglinkedin_button-30.png