Introduction

 

The monitoring of the resources in a Hadoop cluster is first and foremost on the Hadoop admins mind. This post will go into how we can achieve that without having to rely on the storage team.  The basis of this solution is the widespread use of Grafana for monitoring large scale architectures.  Grafana was chosen by Hortonworks to augment Ambari's capabilities and the majority of this blog is focused on Ambari.  However, this setup will work just fine for Cloudera distributions just skip the Ambari pieces below.  This project is based on the original work of the Isilon Data Insights Connector package released last year and available at https://github.com/Isilon/isilon_data_insights_connector Please review this package and its installation requirements, notably Grafana v. 3.1.1 or higher (since 4.2 is the latest I finished my testing at this version) and Influxdb v. 0.9+.  I have used OneFS version 8.0.1.1 in my testing because this is the version where we introduced support for Ambari metrics, but this should work with all Isilon clusters that are at 8.0+

 

 

Ambari Metrics

 

First and foremost begin with a proper configuration of Ambari metrics for Isilon.  I did a short video to walk through it.  Our Engineering team did a very good Blog that is more comprehensive. This is a good place to start.  There is a good description of the differences between how Isilon works as Namenode/Datanode and what makes sense to monitor.  In the following sections we are going to extend this from the Grafana side of things.  Since Ambari Metrics is already configured to work with Grafana let's build on that model and make Isilon Statistics available to Grafana as well.

 

So if you like the single pane of glass and high level view provided by Ambari Metrics service, you can stop reading now.  If however you are interested in things like NN atomic operations and Isilon Cache performance then let's get started!

 

Installation

 

All we're going to need is a Centos VM with network access to the Isilon System Zone. Installation will follow the following high level plan.

  1. Creation of a role and user on Isilon to read the statistics.
  2. Upgrading Grafana that is installed with Ambari.
  3. Importing new Ambari Dashboards
  4. Installation of Influxdb on the same node as the ambari Metrics collector.
  5. Installation of the Isilon Data Insights Connector package.
  6. Importing the dashboards for Isilon
  7. Reviewing what these dashboards say.

RBAC

 

The first important piece of information you will need is a user account for the Isilon Platform API (PAPI) which is our Restful programming interface and is responsible for pulling stats off the Isilon Cluster.  In most of my travels I come across a clear division between the Hadoop admins and the Isilon admins, so we need to bridge that gap by leveraging Isilon's Role Based Access Controls (RBAC).  We will create a READ ONLY role on the Isilon cluster that will have access to PAPI and the statistics subsystem, then we will assign this role to a particular user. I will be using a local Isilon user from the System zone, you may want to choose an existing user from a different Authentication Provider (LDAP or AD...), I will leave that decision for you and your corporate policies.

 

You will need to bring the following commands to the Isilon admin to have him make you an account.

 

isi auth roles create perfmon

isi auth roles modify perfmon --add-priv-ro ISI_PRIV_LOGIN_PAPI

isi auth roles modify perfmon --add-priv-ro ISI_PRIV_STATISTICS
isi auth users create pmon --home-directory=/ifs/home/pmon --enabled=true --set-password

(The --set-password flag allows inputing a password on the command line during user creation without it being visible or in the history, you will need this later so remember it.)

isi auth users modify pmon --password-expires false

isi auth roles modify perfmon --add-user pmon


Review:

isi auth roles view perfmon                                                            
       Name: perfmon
Description: -
    Members: pmon
Privileges
             ID: ISI_PRIV_LOGIN_PAPI
      Read Only: True

             ID: ISI_PRIV_STATISTICS
      Read Only: True


This user is the one we will use in the setup of the Isilon Data Insights connector. You only need this one instance of the connector as it will update the InfluxDB database and other users will only need Grafana logins, which can be setup separately.

 

 

Upgrading Grafana

 

Grafana 2.6 is installed by default when you deploy HDP 2.5, so we will have to replace it with a current Grafana .  We are going to do this on the node where we have the Ambari Metrics Collector Installed.  it should be noted that it's not really required you could install it as a standalone server and run two Grafana instances.  But that's too confusing for me so I'm going the upgrade route.  If you notice we did not remove the Grafana that was installed with HDP... feel free to explore this yourself, but I am always wary of messing with the distro *too* much. But this Grafana server should start ahead of the one installed with HDP.

 

The easiest way to do it is to stop the Grafana service in Ambari and install the new Grafana RPM.  First check the local installation and make sure that you backup the config files etc... just in case this all goes south ;-)

 

[root@nr-hdp-c3 ambari-metrics-grafana]# pwd
/var/log/ambari-metrics-grafana
[root@nr-hdp-c3 ambari-metrics-grafana]# tail grafana.log     
  [1]: default.paths.logs=/var/log/ambari-metrics-grafana
Paths:
  home: /usr/lib/ambari-metrics-grafana
  data: /var/lib/ambari-metrics-grafana
  logs: /var/log/ambari-metrics-grafana

 


You might have to install a few other packages.

yum install initscripts fontconfig


The rpm can be obtained

wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.3.2-1.x86_64.rpm

yum localinstall grafana-4.2.0-1.x86_64.rpm

 

/bin/systemctl daemon-reload
/bin/systemctl enable grafana-server.service

/bin/systemctl start grafana-server.service


Ambari Dashboards


     Since we have a new version of Grafana we need new versions of the canned Dashboards that are compatible.  Fortunately for us they are already ported and can be installed simple enough.  See this post:  https://grafana.com/plugins/praj-ams-datasource



Influxdb


This is a time series DB and quite useful in its own right.  We are going to use this to house our Isilon Statistics that we extract with the Data Insights package. this installation is pretty simple, create the repo file, install the package check the config file and fire it up.


[root@nr-hdp-c3]#  cat > /etc/yum.repos.d/influxdb.repo
[influxdb]

name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key

<ctrl-D>

yum update

yum install influxdb


Influxdb uses port 8086 by default, so check it:


netstat -a -n | grep 8086

 

If its in use it will show up here. You just need to pick another port and update the config file, /etc/influxdb/infuxdb.conf


Once thats all good fire it up:

service influxdb start
service influxdb status


Installing Data Insights Connector

 

     Please review and follow as closely as possible the installation of the Data Insights package from the GitHub https://github.com/Isilon/isilon_data_insights_connector

 

There are two new files there for Hadoop, and an update to the origianal example_isi_data_insights_d.cfg file to include stats for the node based netwrok and disk statistics.

  1. grafana_hadoop_home.json - the Dashboard you will use as a replacement for the HDFS Home dashbaord from Ambari metrics
  2. grafana_hadoop_datanodes.json - The Dashboard you will use for the Data node dashboard from Ambari metrics.

 

Let me point out, in the user name space of the config file you will use the RBAC user we created above: pmon So in t he clusters section you would put that user here.

 

clusters: 

        pmon:emc@10.111.158.70:False

 

Then check the next section :

 

active_stat_groups: cluster_cpu_stats

    cluster_network_traffic_stats

    cluster_client_activity_stats

    cluster_health_stats

    ifs_space_stats

    ifs_rate_stats

    node_load_stats

    node_disk_stats

    node_net_stats

    cluster_disk_rate_stats

    cluster_proto_stats

    cache_stats

    heat_total_stats

 

     If those aren't already there include them and then add these two sections below:

[node_disk_stats]

update_interval: *

stats: node.disk.bytes.out.rate.avg

  node.disk.bytes.in.rate.avg

  node.disk.busy.avg

  node.disk.xfers.out.rate.avg

  node.disk.xfers.in.rate.avg

  node.disk.xfer.size.out.avg

  node.disk.xfer.size.in.avg

  node.disk.access.latency.avg

  node.disk.access.slow.avg

  node.disk.iosched.queue.avg

  node.disk.iosched.latency.avg

 

[node_net_stats]

update_interval: *

stats: node.net.int.bytes.in.rate

  node.net.int.bytes.out.rate

  node.net.ext.bytes.in.rate

  node.net.ext.bytes.out.rate

  node.net.int.errors.in.rate

  node.net.int.errors.out.rate

  node.net.ext.errors.in.rate

  node.net.ext.errors.out.rate

 

 

 

then you are ready to run it:


     ./isi_data_insights_d.py start -c example_isi_data_insights_d.cfg


Hopefully everything started up nicely.  We can hop over to Grafana and customize it there.  Follow along the documentation above at Github and during the import make sure to add the additional two .json files.



dashboards.png


After you import the dashboards they will show up along with the other dashboards.  The two I've made are annotated with (Isilon).  There are four original dashboards that are extremely well done and provide a VERY broad range of capabilities and are targeted to the Isilon admin and keeping up across the whole Isilon cluster and ALL protocols, true data lake telemetry.


HDFS - Home (Isilon)


The top part shows the network throughput (Read and Write) and the HDFS Throughput which will be a subset of the overall Isilon throughput.  there is are two filters the Isilon cluster name, for customers lucky enough to have more than 1, and the node number.  Not all statisitcs would change when filtering by node number (i.e. capacity) but the ones that do can be interesting, like Load and cache.  There are also some links to the Isilon documentation on the right side.


HDFS-home-1.png

This section has the Open files and HDFS (only) connections per node.  Below that the OneFS filesytem throughput, which would be across *all* protocols. and the Cluster capacity which would be across all Access Zones.

HDFS-home-2.png

These two graphs show the HDFS Protocol broken out... by class as a percentage of the whole, and by the atomic operations, as a number of Ops.

HDFS-home-3.png

These are the L1/L2/L3 cache statistics for the cluster.  Very important to your overall performance.  L3 sits on SSD, so only nodes with SSDs can use this.  My virtual nodes do not have this so you see it's 0.  The second graph is the metadata version, broken out by node so I can see how well balanced my workflow is.

HDFS-home-5.png

HDFS-home-7.png


HDFS Datanodes (Isilon)


I made the this dashboard to better represent the behavior on Isilon that you would want t oreview about your Datanodes, like disks and network.


HDFS-datanode-1.png

     This section has spme disk specifics that show some of the basics R/W, Latency these should help you characterize whats going on at the disk level. 

HDFS-datanode-2.png

This section might be the most useful...don't ask me why its last... But network traffic is important in looking at performance.  I can't tell you how many times isilon performance issues turned out to be network issues.  So a quick look at this every day should give you a warm cozy feeling everything is running as planned.


HDFS-datanode-3.png



Wrap-up


I hope this helps in your day to day activities for reviewing and monitroing your Isilon cluster.  The magic of Grafana brings it all together nicely.  Stop back periodically and look for updates I will be posting updates and any new dashboards I create or come across.  Use the space below and share your experiences and any bugs or inconsistencies you come across.