Using Kapacitor to generate alerts for an Isilon OneFS cluster

NOTE: This topic is part of the Uptime Information Hub.

 

Kapacitor real-time streaming data processing

 

This document details how to add the Kapacitor real-time streaming data processing engine component to the Isilon data insights connector, InfluxDB, and Grafana to allow flexible, configurable, real-time notification of alert conditions based off of the data being collected.

 

Kapacitor leverages the ability to subscribe to updates to the InfluxDB time series database to enable real-time responses to the data being received.

 

How these components work together

 

Isilon OneFS incorporates a powerful application program interface (API) for management, monitoring, and control of the OneFS operating system. Isilon has recently published a software development kit (SDK) providing language bindings for the OneFS API, and a data insights statistics connector to enable easier programmatic access to the OneFS API. You can browse these tools at the following Github sites:

 

Isilon Data Insights Connector

 

The Isilon data insights statistics connector enables the administrator of a OneFS cluster to specify statistics groups to gather, store, and visualize data using a combination of the connector, an InfluxDB time-series database, and the Grafana visualization platform. Browse information about those tools at the following links:

 

Initial setup

 

This article assumes that you have already installed and configured the Isilon data insights connector, and that you have a functioning Grafana/InfluxDB instance monitoring one or more OneFS clusters. Use the following instructions for the initial Kapacitor setup:

  1. Install Kapacitor from:  https://www.influxdata.com/downloads/#kapacitor.
  2. The Influxdata getting started page contains useful examples, but is not entirely relevant for our use, since it is leveraging the Telegraf agent from InfluxData to generate statistics. In our case, we already have sets of statistics (measurements) in the InfluxDB being fed by the Isilon data insights connector.

    Note that the RPM package manager already includes the configuration file (/etc/kapacitor/kapacitor.conf) so you do not need to generate one.
  3. Edit the /etc/kapacitor/kapacitor.conf file to change the alert provider configurations as necessary. For example, to enable email alerts, find the section beginning “[smtp]” and modify the configuration to use an available SMTP provider.

 

Kapacitor Scripting

 

Kapacitor uses one or more tasks that are defined using “TICK” scripts to control what data should be filtered, how it should be filtered, and what criteria to use to alert based off of the data. The TICK scripts are a domain-specific language (DSL) and are somewhat tersely documented on the Kapacitor documentation site. This article presents some example scripts, and presents some patterns to enable more sophisticated criteria for alerting (for example, moving average statistics).

 

How to create and enable a TICK task

 

Edit the script using your favorite text editor. We suggest that the name of these scripts use the “.tick” extension, for example,  “nfs_avg_lat_alert.tick”.

 

Install the script into Kapacitor using the command line inteface (CLI). The generic form of the command is:

kapacitor define <internal_kapacitor_name> -type stream -tick <path_to_tick script> -dbrp isi_data_insights.autogen

 

The internal name should be something descriptive. We will only be using stream scripts in our examples. Note that Kapacitor can also perform batch-processing.

 

The path to the script is as follows: The “-dbrp” argument specifies the InfluxDB “database retention policy”. Since we are using the Isilon data insights connector database, the correct value for our examples is “isi_data_insights.autogen” (this value would differ if a different source database were in use). If we are using “nfs_avg_lat_alert.tick” as our example script, then the command to define the task would be:

kapacitor define nfs_lat_alert -type stream -tick /root/nfs_avg_lat_alert.tick -dbrp isi_data_insights.autogen

 

Here is the “nfs_avg_lat_alert.tick” script:

stream

    // Select avg NFS3 proto response time

    |from()

        .database('isi_data_insights')

.measurement('cluster.protostats.nfs.total')

    |eval(lambda: float("time_avg") / 1000.0)

    .as('time_ms')

    |groupBy('cluster')

    |alert()

        .id('{{ index .Tags "cluster" }}/{{ .Name }}')

        .message('Average value of {{ .ID }} is {{ .Level}} value: {{ index .Fields "time_ms" }}ms')

        .crit(lambda: "time_ms" > 50.0)

        .warn(lambda: "time_ms" > 20.0)

        // Only warn every 15 mins if we haven't changed state

        .stateChangesOnly(15m)

        // Whenever we get an alert write it to a file.

        .log('/tmp/alerts.log')

        .slack()

 

Now let’s break this down further:

  • This is a stream filter so it starts with “stream”.
  • Next, we choose where we are pulling our data from. In this case, the “isi_data_insights” database, which is populated by the Isilon data insights connector, is already installed. We are choosing a single measurement: “cluster.protostats.nfs.total” which are the totaled (clusterwide as opposed to node-specific) NFS3 protocol statistics.
  • Next, we have an “eval” node which takes the “time_avg” measurement for the operations, and divides it by 1000. The statistics values are in microseconds. Hence, we can see that this node is converting the values to milliseconds.
  • Next, we have a “groupby” node, that is using the measurement tag “cluster” because the statistics for each cluster are distinct (for example, we don’t want a low value from one cluster resetting the alert threshold of another cluster).
  • Finally, the “alert” node. This is quite detailed:
    • We define the alert id that appears in the messages. In this case it will be <clustername>/nfs_lat_alert
    • We define the format of the message that appears in the alert. “.Level” is the alert level (crit, warn, info, ok). We index into the fields of the measurement to extract the “time_ms” field we generated to show the actual time value.
    • The “.crit” and “.warn” nodes define a Boolean lambda function that determines whether that alert level has been reached. In this case, we’re defining the critical level to be a latency of greater than 50ms, and the warning level to be a latency of greater than 20ms.
    • We add a “squelch” so we only see one alert every 15 minutes if the alert level hasn’t changed, so that we don’t get spammed with messages every 30 seconds (or whatever the data insights connector interval is set to).
    • The ”.log” node simply logs these alerts to a local file (useful for testing).
    • In this case, the alert is configured to use the Slack channel. This can be changed to use “.email” if that has been configured in the /etc/kapacitor/kapacitor.conf file, or “.post” to use the HTML POST method on a given URL. Numerous other alert channels are available. See the Kapacitor documentation for details.

 

Provided the syntax is correct, and that the correct command is used, the task should now be defined in Kapacitor. However, it won’t be enabled:

kapacitor list tasks

Output similar to the following displays:

ID                            Type      Status Executing Databases and Retention Policies

nfs_lat_alert                stream    disabled false ["isi_data_insights"."autogen"]

 

To enable the task, simply type:

kapacitor enable nfs_lat_alert

The task should now be enabled:

kapacitor list tasks

Output similar to the following displays:

ID                            Type      Status Executing Databases and Retention Policies

nfs_lat_alert                stream    enabled true ["isi_data_insights"."autogen"]

 

Check the status of the task and see the results at each node in the script:

kapacitor show nfs_lat_alert

Output similar to the following displays:

ID: nfs_lat_alert

Error:

Template:

Type: stream

Status: enabled

Executing: true

Created: 10 Aug 16 12:10 PDT

Modified: 16 Aug 16 06:40 PDT

LastEnabled: 16 Aug 16 06:40 PDT

Databases Retention Policies: ["isi_data_insights"."autogen"]

TICKscript:

stream

    // Select avg NFS3 proto response time

    |from()

        .database('isi_data_insights')

.measurement('cluster.protostats.nfs.total')

    |eval(lambda: float("time_avg") / 1000.0)

        .as('time_ms')

    |groupBy('cluster')

    |alert()

        .id('{{ index .Tags "cluster" }}/{{ .Name }}')

        .message('Average value of {{ .ID }} is {{ .Level}} value: {{ index .Fields "time_ms" }}ms')

        .crit(lambda: "time_ms" > 50.0)

        .warn(lambda: "time_ms" > 20.0)

        // Only warn every 15 mins if we haven't changed state

        .stateChangesOnly(15m)

        // Whenever we get an alert write it to a file.

        .log('/tmp/alerts.log')

        .slack()

 

DOT:

digraph nfs_lat_alert {

graph [throughput="0.00 points/s"];

 

stream0 [avg_exec_time_ns="0" ];

stream0 -> from1 [processed="58279"];

 

from1 [avg_exec_time_ns="1.215s" ];

from1 -> eval2 [processed="58279"];

 

eval2 [avg_exec_time_ns="208.86s" eval_errors="0" ];

eval2 -> groupby3 [processed="58279"];

 

groupby3 [avg_exec_time_ns="28.392s" ];

groupby3 -> alert4 [processed="58279"];

 

alert4 [alerts_triggered="2457" avg_exec_time_ns="87.22134ms" crits_triggered="836" infos_triggered="0" oks_triggered="1008" warns_triggered="613" ];

}

 

This output shows that the script is working and triggering on events. The “DOT:” section can be rendered as a graph using the “GraphViz” package.

 

Sample output generated in our Slack channel


Kapacitor_slack.jpg


This initial script works well, but is rather simplistic and, in particular, will alert on momentary spikes in load which may not be desirable. Example TICK script patterns are as follows:

 

Moving average of measurement

 

This is an example of a script that uses a moving window to average the statistic value over a recent window:

stream

    // Select avg NFS3 proto response time

    |from()

        .database('isi_data_insights')

.measurement('cluster.protostats.nfs.total')

    |groupBy('cluster')

    |window()

        .period(10m)

        .every(1m)

    |mean('time_avg')

        .as('time_avg')

    |eval(lambda: float("time_avg") / 1000.0)

        .as('mean_ms')

        .keep('mean_ms', 'time_avg')

    |alert()

        .id('{{ index .Tags "cluster" }}/{{ .Name }}')

        .message('Windowed average of avg value of {{ .ID }} is {{ .Level}} value: {{ index .Fields "mean_ms" }}ms')

        .crit(lambda: "mean_ms" > 50.0)

        .warn(lambda: "mean_ms" > 25.0)

        // Only warn every 15 mins if we haven't changed state

        .stateChangesOnly(15m)

        // Whenever we get an alert write it to a file.

        .log('/tmp/alerts.log')

        .slack()

 

This script is similar to the previous script, but there are a few important differences:

  • The “window” node generates a window of data. With the values specified, we will keep and output the last 10 minutes of data every minute.
  • The window output is fed into a “mean” node that calculates the mean of the data fed (the last 10 minutes of data, in this case the “time_avg” field), and stores the result back as the “time_avg” field to be fed further down the pipeline.
  • The “eval” node converts the microsecond average field to a new “mean_ms” field.
  • The rest of the alert is similar to the previous example.

 

Joining/alerting based off two different measurements

 

This script alerts based off a moving average, but only if the operation count is above a given threshold. It’s probably not safe to use this as the sole alerting mechanism, because a deadlock (which will reduce the operation count to zero) won’t generate an alert.

 

Additional scripts are provided below to look for deadlock events (“node.ifs.heat.deadlocked.total” measurement) and to alert if no data points have been collected in a configurable period.

// Alert based off mean NFS3 proto response time if work is actually happening

 

var timestream = stream

    |from()

        .database('isi_data_insights')

.measurement('cluster.protostats.nfs.total')

    |groupBy('cluster')

    |window()

        .period(10m)

        .every(1m)

    |mean('time_avg')

        .as('time_avg')

    |eval(lambda: float("time_avg") / 1000.0)

        .as('mean_ms')

 

var opstream = stream

    |from()

        .database('isi_data_insights')

        .measurement('cluster.protostats.nfs.total')

    |groupBy('cluster')

    |window()

        .period(10m)

        .every(1m)

    |mean('op_rate')

        .as('op_rate')

 

timestream

    |join(opstream)

        .as('times', 'ops')

    |alert()

        .id('{{ index .Tags "cluster" }}/{{ .Name }}')

        .message('Cluster {{ index .Tags "cluster" }} is executing {{ index .Fields "ops.op_rate" }} NFSv3 operations per second and windowed average of avg value of {{ .Name }} is {{ .Level }} value: {{ index .Fields "times.mean_ms" }}ms')

        .crit(lambda: "ops.op_rate" > 1000 AND "times.mean_ms" > 25.0)

        .warn(lambda: "ops.op_rate" > 1000 AND "times.mean_ms" > 10.0)

        // .info(lambda: TRUE)

        // Only warn every 15 mins if we haven't changed state

        .stateChangesOnly(15m)

        // Whenever we get an alert write it to a file.

        .log('/tmp/alerts.log')

        .slack()

 

This script is significantly different to our previous examples. It uses variables to store the results of the two different streams that we sample, and then uses a “join” operation to create a stream with both sets of data for us to alert from.

 

Deadman alert to warn if data collection fails

 

This script uses the Kapacitor “Deadman” node to warn when the collected/emitted point count falls below a defined threshold in a given period. Many of the statistics collected by the data insights statistics connector are updated as frequently as every 30 seconds, but the overall collection period can be longer: (1) if many clusters are being monitored, (2) if they are large, and/or (3) if they are under heavy load.

 

We will arbitrarily choose 5 minutes as the interval for this example.

// Deadman alert for cluster data collection

var data = stream

    |from()

        .database('isi_data_insights')

        .measurement('cluster.health')

        .groupBy('cluster')

 

data

    |deadman(1.0, 5m)

        .id ('Statistics data collection for cluster {{ index .Tags "cluster" }}')

        .slack()

 

This script will output alerts of the form:

 

    Statistics collection for cluster logserver is dead: 0.0

or

    Statistics collection for cluster logserver is alive: 1.0

 

Deadlock event count alert

 

This script uses one of the OneFS filesystem “heat” statistics to look for high rates of deadlocks within the filesystem.

stream

    // Alert based off node heat stats

    |from()

        .database('isi_data_insights')

.measurement('node.ifs.heat.deadlocked.total')

    |groupBy('cluster')

    |alert()

        .id('Deadlock event count')

        .message('Value of {{ .ID }} on cluster {{ index .Tags "cluster" }}, node {{ index .Tags "node" }} is {{ .Level }} value: {{ index .Fields "value" }}')

        .crit(lambda: "value" > 50.0)

        .warn(lambda: "value" > 10.0)

        // .info(lambda: TRUE)

        // Only warn every 15 mins if we haven't changed state

        .stateChangesOnly(15m)

        // Whenever we get an alert write it to a file.

        .log('/tmp/alerts.log')

        .slack()

 

Other useful node types

 

Kapacitor offers a number of useful processing nodes to filter the data. Examples that are of particular interest are:

 

Mean/median/mode

Computes the various average types.

Max/min

Selects the largest/smallest point.

MovingAverage

A relatively new function that would simplify our earlier example.

Stddev

Computes the standard deviation of points. Useful to detect anomalies.

Sum

Sums the points.

Deadman

Alerts if the points per interval drops below a given threshold. Useful to alert if the collector fails for some reason.

 

Conclusion

 

Kapacitor enables powerful real-time alerts when used in conjunction with an InfluxDB instance. This blog article demonstrates how to leverage Kapacitor in conjunction with the Isilon data insights open source tool chain available at https://github.com/isilon.