The monitoring of the resources in a Hadoop cluster is first and foremost on the Hadoop admins mind. This post will go into how we can achieve that without having to rely on the storage team. The basis of this solution is the widespread use of Grafana for monitoring large scale architectures. Grafana was chosen by Hortonworks to augment Ambari's capabilities and the majority of this blog is focused on Ambari. However, this setup will work just fine for Cloudera distributions just skip the Ambari pieces below. This project is based on the original work of the Isilon Data Insights Connector package released last year and available at https://github.com/Isilon/isilon_data_insights_connector Please review this package and its installation requirements, notably Grafana v. 3.1.1 or higher (since 4.2 is the latest I finished my testing at this version) and Influxdb v. 0.9+. I have used OneFS version 188.8.131.52 in my testing because this is the version where we introduced support for Ambari metrics, but this should work with all Isilon clusters that are at 8.0+
First and foremost begin with a proper configuration of Ambari metrics for Isilon. I did a short video to walk through it. Our Engineering team did a very good Blog that is more comprehensive. This is a good place to start. There is a good description of the differences between how Isilon works as Namenode/Datanode and what makes sense to monitor. In the following sections we are going to extend this from the Grafana side of things. Since Ambari Metrics is already configured to work with Grafana let's build on that model and make Isilon Statistics available to Grafana as well.
So if you like the single pane of glass and high level view provided by Ambari Metrics service, you can stop reading now. If however you are interested in things like NN atomic operations and Isilon Cache performance then let's get started!
All we're going to need is a Centos VM with network access to the Isilon System Zone. Installation will follow the following high level plan.
- Creation of a role and user on Isilon to read the statistics.
- Upgrading Grafana that is installed with Ambari.
- Importing new Ambari Dashboards
- Installation of Influxdb on the same node as the ambari Metrics collector.
- Installation of the Isilon Data Insights Connector package.
- Importing the dashboards for Isilon
- Reviewing what these dashboards say.
The first important piece of information you will need is a user account for the Isilon Platform API (PAPI) which is our Restful programming interface and is responsible for pulling stats off the Isilon Cluster. In most of my travels I come across a clear division between the Hadoop admins and the Isilon admins, so we need to bridge that gap by leveraging Isilon's Role Based Access Controls (RBAC). We will create a READ ONLY role on the Isilon cluster that will have access to PAPI and the statistics subsystem, then we will assign this role to a particular user. I will be using a local Isilon user from the System zone, you may want to choose an existing user from a different Authentication Provider (LDAP or AD...), I will leave that decision for you and your corporate policies.
You will need to bring the following commands to the Isilon admin to have him make you an account.
isi auth roles create perfmon
isi auth roles modify perfmon --add-priv-ro ISI_PRIV_LOGIN_PAPI
isi auth roles modify perfmon --add-priv-ro ISI_PRIV_STATISTICS
isi auth users create pmon --home-directory=/ifs/home/pmon --enabled=true --set-password
(The --set-password flag allows inputing a password on the command line during user creation without it being visible or in the history, you will need this later so remember it.)
isi auth users modify pmon --password-expires false
isi auth roles modify perfmon --add-user pmon
isi auth roles view perfmon
Read Only: True
Read Only: True
This user is the one we will use in the setup of the Isilon Data Insights connector. You only need this one instance of the connector as it will update the InfluxDB database and other users will only need Grafana logins, which can be setup separately.
Grafana 2.6 is installed by default when you deploy HDP 2.5, so we will have to replace it with a current Grafana . We are going to do this on the node where we have the Ambari Metrics Collector Installed. it should be noted that it's not really required you could install it as a standalone server and run two Grafana instances. But that's too confusing for me so I'm going the upgrade route. If you notice we did not remove the Grafana that was installed with HDP... feel free to explore this yourself, but I am always wary of messing with the distro *too* much. But this Grafana server should start ahead of the one installed with HDP.
The easiest way to do it is to stop the Grafana service in Ambari and install the new Grafana RPM. First check the local installation and make sure that you backup the config files etc... just in case this all goes south ;-)
[root@nr-hdp-c3 ambari-metrics-grafana]# pwd
[root@nr-hdp-c3 ambari-metrics-grafana]# tail grafana.log
You might have to install a few other packages.
yum install initscripts fontconfig
The rpm can be obtained
yum localinstall grafana-4.2.0-1.x86_64.rpm
/bin/systemctl enable grafana-server.service
/bin/systemctl start grafana-server.service
Since we have a new version of Grafana we need new versions of the canned Dashboards that are compatible. Fortunately for us they are already ported and can be installed simple enough. See this post: https://grafana.com/plugins/praj-ams-datasource
This is a time series DB and quite useful in its own right. We are going to use this to house our Isilon Statistics that we extract with the Data Insights package. this installation is pretty simple, create the repo file, install the package check the config file and fire it up.
[root@nr-hdp-c3]# cat > /etc/yum.repos.d/influxdb.repo
name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key
yum install influxdb
Influxdb uses port 8086 by default, so check it:
netstat -a -n | grep 8086
If its in use it will show up here. You just need to pick another port and update the config file, /etc/influxdb/infuxdb.conf
Once thats all good fire it up:
service influxdb start
service influxdb status
Installing Data Insights Connector
Please review and follow as closely as possible the installation of the Data Insights package from the GitHub https://github.com/Isilon/isilon_data_insights_connector
There are two new files there for Hadoop, and an update to the origianal example_isi_data_insights_d.cfg file to include stats for the node based netwrok and disk statistics.
- grafana_hadoop_home.json - the Dashboard you will use as a replacement for the HDFS Home dashbaord from Ambari metrics
- grafana_hadoop_datanodes.json - The Dashboard you will use for the Data node dashboard from Ambari metrics.
Let me point out, in the user name space of the config file you will use the RBAC user we created above: pmon So in t he clusters section you would put that user here.
Then check the next section :
If those aren't already there include them and then add these two sections below:
then you are ready to run it:
./isi_data_insights_d.py start -c example_isi_data_insights_d.cfg
Hopefully everything started up nicely. We can hop over to Grafana and customize it there. Follow along the documentation above at Github and during the import make sure to add the additional two .json files.
After you import the dashboards they will show up along with the other dashboards. The two I've made are annotated with (Isilon). There are four original dashboards that are extremely well done and provide a VERY broad range of capabilities and are targeted to the Isilon admin and keeping up across the whole Isilon cluster and ALL protocols, true data lake telemetry.
HDFS - Home (Isilon)
The top part shows the network throughput (Read and Write) and the HDFS Throughput which will be a subset of the overall Isilon throughput. there is are two filters the Isilon cluster name, for customers lucky enough to have more than 1, and the node number. Not all statisitcs would change when filtering by node number (i.e. capacity) but the ones that do can be interesting, like Load and cache. There are also some links to the Isilon documentation on the right side.
This section has the Open files and HDFS (only) connections per node. Below that the OneFS filesytem throughput, which would be across *all* protocols. and the Cluster capacity which would be across all Access Zones.
These two graphs show the HDFS Protocol broken out... by class as a percentage of the whole, and by the atomic operations, as a number of Ops.
These are the L1/L2/L3 cache statistics for the cluster. Very important to your overall performance. L3 sits on SSD, so only nodes with SSDs can use this. My virtual nodes do not have this so you see it's 0. The second graph is the metadata version, broken out by node so I can see how well balanced my workflow is.
HDFS Datanodes (Isilon)
I made the this dashboard to better represent the behavior on Isilon that you would want t oreview about your Datanodes, like disks and network.
This section has spme disk specifics that show some of the basics R/W, Latency these should help you characterize whats going on at the disk level.
This section might be the most useful...don't ask me why its last... But network traffic is important in looking at performance. I can't tell you how many times isilon performance issues turned out to be network issues. So a quick look at this every day should give you a warm cozy feeling everything is running as planned.
I hope this helps in your day to day activities for reviewing and monitroing your Isilon cluster. The magic of Grafana brings it all together nicely. Stop back periodically and look for updates I will be posting updates and any new dashboards I create or come across. Use the space below and share your experiences and any bugs or inconsistencies you come across.