Another area of OneFS that was recently redesigned and streamlined is Healthchecks. Previously, system health checks on Isilon were prone to several challenges. The available resources were a mixture of on and off-cluster tools, often with separate user interfaces. They were also typically reactive in nature and spread across Isilon Advisor, IOCA, self-service tools, etc. To address these concerns, the new OneFS Healthcheck feature creates a single, common framework for system health check tools, considerably simplifying both the user experience and ease of development and deployment. This affords the benefits of proactive risk, management and reduced resolution time, resulting in overall improved cluster uptime.
OneFS Healthchecks make no changes to the cluster and are complementary to other monitoring services such as CELOG. On detection of an issue, a healthcheck displays an actionable message detailing the problem and recommended corrective activity. If the action is complicated or involves decisions, a knowledge-base (KB) article will often be referenced. Alternatively, if no user action is possible or the remediation path is unclear the recommendation will typically be to be to contact Dell EMC Isilon support.
Healthcheck functions include warning about a non-recommended configuration, automatically detecting known issues with current usage and configuration, and identifying problems and anomalies in the environment where the cluster is deployed (network, AD, etc).
OneFS currently provides sixteen checklist categories containing more than two hundred items, including eighty three IOCA (Isilon On-Cluster Analysis) checks. These are:
All available checks
Checklist to determine the overall health of AVScan
Checklist to determine the overall capacity health for a pool or cluster
Checklist to determine the overall health of the Infiniband backend
Pre-existing perl script that assesses the overall health of a cluster. Checklist contains all integrated IOCA items.
Job Engine-related health checks
Checklist to determine the overall health of log-level
Checklist to determine the overall health of NDMP
Checklist to determine the overall health of nfs
Checklist to determine the overall health of time synchronization
Checklist to determine post-upgrade cluster health
Checklist to determine pre-upgrade cluster health
Checklist to determine the overall health of SmartConnect
Checklist to determine the overall health of SmartPools
Checklist to determine the overall health of smb
Checklist to determine the overall health of snapshots.
Checklist to determine the overall health of SyncIQ
Under the hood, a OneFS health check is a small script which assesses the vitality of a particular aspect of an Isilon cluster. It’s run on-cluster via the new healthcheck framework (HCF) and returns both a status and value:
OK, WARNING, CRITICAL, EMERGENCY, UNSUPPORTED
The following terminology is defined and helpful in understanding the Healthcheck framework:
Script that checks a specific thing
Group related Items for easy use
One instance of running an Item or Checklist
Each item has a ‘freshness’ value which defines whether it’s new or a cached from a previous evaluation
Additional information provided to the item(s)
Output of one Evaluation
Roll-up Patch: The delivery vehicle for new OneFS Healthchecks and patches.
The healthchecks themselves automatically run daily. They can aso be managed via the OneFS CLI using dedicated set of ‘isi healthcheck’ commands. For example, the following syntax will display all the checklist categories available:
# isi healthcheck checklists list
To list or view details of the various individual checks available within each category, use the ‘items’ argument and grep to filter by category. For example, the following command will list all the snapshot checks:
# isi healthcheck items list | grep -i snapshot
fsa_abandoned_snapshots Per cluster Warns if the FSAnalyze job has failed or has left excess snapshots on the cluster after a failure
ioca_checkSnapshot Per cluster Checks if the Snapshot count is approaching cluster limit of 20,000, whether Autodelete is set to yes, and checks snapshot logs. Checks snapshot logs for EIN/EIO/EDEADLK/Failed to create snapshot
old_snapshots Per cluster Checks for the presence of snapshots older than 1 month
snapshot_count Per cluster Verify the snapshot counts on the cluster conform to the limits.
- 1. Active snapshot count - Number of active snapshots in the system.
- 2. In-delete snapshot count - Number of snapshots pending delete.
The details of an individual check, in this case ‘old_snapshots’, can be displayed using the following syntax:
# isi healthcheck items view old_snapshots
Summary: Checks for the presence of snapshots older than 1 month
Scope: Per cluster
Description: * OK: There are no unusually old snapshots stored on the cluster
* WARNING: At least one snapshot stored on the cluster is over one month old.
This does not necessarily constitute a problem and may be intentional, but such
snapshots may consume a considerable amount of storage. Snapshots may be viewed
with 'isi snapshot snapshots list', and marked for automatic removal with 'isi
snapshot snapshots delete <snapshot name>'
The full suite of checks for a particular category (or ‘all’) can be run as follows. For example, to kick of the snapshot checks:
# isi healthcheck run snapshot
The ‘evaluations’ argument can be used to display when each set of healthchecks was run. In this case, listing and grep’ing for snapshots will show when the test suite was executed, whether it completed, and whether it passed, etc:
# isi healthcheck evaluations list | grep -i snapshot
snapshot20190924T2046 Completed - Pass - /ifs/.ifsvar/modules/health-check/results/evaluations/snapshot20191014T2046
The ‘evaluations view’ argument can be used to display the details of a particular healthcheck run, including whether it completed, whether it passed, specifics of any failures, and the location of the pertitnent logfile:
# isi healthcheck evaluations view snapshot20191014T2046
Run Status: Completed
New health checks are included in Roll-Up Patches, or RUPs (previously known as Service Packs), for common versions of OneFS, specifically 220.127.116.11, 18.104.22.168, 22.214.171.124, 8.1.2, 8.1.3, 8.2.0, 8.2.1. The RUPs for these releases are typically delivered monthly and new checks are added to subsequent RUPs.
With the delivery of each new RUP for a particular release, the core OneFS release is also rebuilt to include the latest health checks and patches. This means that the customer download URL for a OneFS release will automatically include latest pre-installed RUP, thereby removing an additional patching/reboot requirement from the cluster’s maintenance cycle. The checks run across all nodes and are typically run daily. The results are also automatically incorporated into ‘isi_phone_home’ data.