Another area of OneFS that was recently redesigned and streamlined is Healthchecks. Previously, system health checks on Isilon were prone to several challenges. The available resources were a mixture of on and off-cluster tools, often with separate user interfaces. They were also typically reactive in nature and spread across Isilon Advisor, IOCA, self-service tools, etc. To address these concerns, the new OneFS Healthcheck feature creates a single, common framework for system health check tools, considerably simplifying both the user experience and ease of development and deployment. This affords the benefits of proactive risk, management and reduced resolution time, resulting in overall improved cluster uptime.


OneFS Healthchecks make no changes to the cluster and are complementary to other monitoring services such as CELOG. On detection of an issue, a healthcheck displays an actionable message detailing the problem and recommended corrective activity. If the action is complicated or involves decisions, a knowledge-base (KB) article will often be referenced. Alternatively, if no user action is possible or the remediation path is unclear the recommendation will typically be to be to contact Dell EMC Isilon support.

Healthcheck functions include warning about a non-recommended configuration, automatically detecting known issues with current usage and configuration, and identifying problems and anomalies in the environment where the cluster is deployed (network, AD, etc).

OneFS currently provides sixteen checklist categories containing more than two hundred items, including eighty three IOCA (Isilon On-Cluster Analysis) checks. These are:


Category

Description

All

All available checks

Avscan

Checklist to determine the overall health of AVScan

Cluster_capacity

Checklist to determine the overall capacity health for a pool or cluster

Infiniband

Checklist to determine the overall health of the Infiniband backend

IOCA

Pre-existing perl script that assesses the overall health of a cluster. Checklist contains all integrated IOCA items.

Job_engine

Job Engine-related health checks

Log_level

Checklist to determine the overall health of log-level

NDMP

Checklist to determine the overall health of NDMP

NFS

Checklist to determine the overall health of nfs

NTP

Checklist to determine the overall health of time synchronization

Post-upgrade

Checklist to determine post-upgrade cluster health

Pre-upgrade

Checklist to determine pre-upgrade cluster health

SmartConnect

Checklist to determine the overall health of SmartConnect

SmartPools

Checklist to determine the overall health of SmartPools

SMB

Checklist to determine the overall health of smb

Snapshot

Checklist to determine the overall health of snapshots.

Synciq

Checklist to determine the overall health of SyncIQ


Under the hood, a OneFS health check is a small script which assesses the vitality of a particular aspect of an Isilon cluster. It’s run on-cluster via the new healthcheck framework (HCF) and returns both a status and value:

 

Health Attribute

Description

Status

OK, WARNING, CRITICAL, EMERGENCY, UNSUPPORTED

Value

  1. 100  Is healthy; 0 is not.


The following terminology is defined and helpful in understanding the Healthcheck framework:


Type

Description

Item

Script that checks a specific thing

Checklist

Group related Items for easy use

Evaluation

One instance of running an Item or Checklist

Freshness

Each item has a ‘freshness’ value which defines whether it’s new or a cached from a previous evaluation

Parameter

Additional information provided to the item(s)

Result

Output of one Evaluation

RUP

Roll-up Patch: The delivery vehicle for new OneFS Healthchecks and patches.

 

CLI commands:


The healthchecks themselves automatically run daily. They can aso be managed via the OneFS CLI using dedicated set of ‘isi healthcheck’ commands. For example, the following syntax will display all the checklist categories available: 


# isi healthcheck checklists list


To list or view details of the various individual checks available within each category, use the ‘items’ argument and grep to filter by category. For example, the following command will list all the snapshot checks:


# isi healthcheck items list | grep -i snapshot

fsa_abandoned_snapshots        Per cluster   Warns if the FSAnalyze job has failed or has left excess snapshots on the cluster after a failure

ioca_checkSnapshot             Per cluster   Checks if the Snapshot count is approaching cluster limit of 20,000, whether Autodelete is set to yes, and checks snapshot logs. Checks snapshot logs for EIN/EIO/EDEADLK/Failed to create snapshot

old_snapshots                  Per cluster   Checks for the presence of snapshots older than 1 month

snapshot_count                 Per cluster   Verify the snapshot counts on the cluster conform to the limits.

  1. 1. Active snapshot count - Number of active snapshots in the system.
  2. 2. In-delete snapshot count - Number of snapshots pending delete.


The details of an individual check, in this case ‘old_snapshots’, can be displayed using the following syntax:


# isi healthcheck items view old_snapshots

Name: old_snapshots

Summary: Checks for the presence of snapshots older than 1 month

Scope: Per cluster

Freshness: Now

Parameters:

freshness_days(38)  *

Description: * OK: There are no unusually old snapshots stored on the cluster

* WARNING: At least one snapshot stored on the cluster is over one month old.

This does not necessarily constitute a problem and may be intentional, but such

snapshots may consume a considerable amount of storage. Snapshots may be viewed

with 'isi snapshot snapshots list', and marked for automatic removal with 'isi

snapshot snapshots delete <snapshot name>'

 

The full suite of checks for a particular category (or ‘all’) can be run as follows. For example, to kick of the snapshot checks:


# isi healthcheck run snapshot


The ‘evaluations’ argument can be used to display when each set of healthchecks was run. In this case, listing and grep’ing for snapshots will show when the test suite was executed, whether it completed, and whether it passed, etc:


# isi healthcheck evaluations list | grep -i snapshot

snapshot20190924T2046 Completed - Pass - /ifs/.ifsvar/modules/health-check/results/evaluations/snapshot20191014T2046

 

The ‘evaluations view’ argument can be used to display the details of a particular healthcheck run, including whether it completed, whether it passed, specifics of any failures, and the location of the pertitnent logfile:

 

# isi healthcheck evaluations view snapshot20191014T2046

ID: snapshot20191014T2046

Checklist: snapshot

Overrides: -

Parameters: {}

Run Status: Completed

Result: Pass

Failure: -

Logs: /ifs/.ifsvar/modules/health-check/results/evaluations/snapshot20191014T2046

 

New health checks are included in Roll-Up Patches, or RUPs (previously known as Service Packs), for common versions of OneFS, specifically 8.0.0.7, 8.1.0.2, 8.1.0.4, 8.1.2, 8.1.3, 8.2.0, 8.2.1. The RUPs for these releases are typically delivered monthly and new checks are added to subsequent RUPs.

 

With the delivery of each new RUP for a particular release, the core OneFS release is also rebuilt to include the latest health checks and patches. This means that the customer download URL for a OneFS release will automatically include latest pre-installed RUP, thereby removing an additional patching/reboot requirement from the cluster’s maintenance cycle. The checks run across all nodes and are typically run daily. The results are also automatically incorporated into ‘isi_phone_home’ data.