Advanced Troubleshooting of an Isilon Cluster Part 1

NOTE: This topic is part of the Uptime Information Hub.

 

Next >

 

 

Introduction

There is a saying, "an ounce of prevention is worth a pound of cure." This is certainly true of managing an EMC Isilon cluster. This article describes specific tasks that can help you keep your cluster healthy.

 

When you experience technical difficulties with your EMC Isilon cluster, it is important to quickly find the source of the issue and resolve it. Some issues, such as Isilon data integrity (IDI) errors, require immediate attention from EMC Online Support. However, there are issues that you can effectively troubleshoot yourself. Helping yourself is a worthy goal that benefits everyone. However, if you’re not careful, it is all too easy to turn a minor problem into a major issue . If you are not sure how to resolve an issue, contact EMC Online Support. Protecting the integrity of your data is our primary concern, and we are always available to help you.

OneFS: What is it?

OneFS is a:

  • Unix-based file system
  • Single namespace file system
  • Clustered, distributed file system

OneFS is largely written in C and Python and includes regular tools like truss, tcpdump, and so on.

 

Although in the cluster is treated as a “black box” in some ways, it is not locked down, and there are useful debugging options on the cluster. However, because root access enables you to do many things, we recommend proceeding with caution when you are logged in as root.

 

If you are investigating protocol issues, you can take packet traces on the cluster by using tcpdump (or its friendly wrapper, isi_netlogger). You can then use Wireshark or Microsoft NetMon to examine them.

 

Similarly, if you are having issues with a command on the cluster and want to see more of what it is doing, you can use the truss command to monitor the system calls made by the command and view the results of these calls. The truss command is essentially the same as the strace command on Linux.

 

At a much more basic level, having shell access allows you to examine files and directories, read logs, perform simple tests, and work with many other command-line tools.

 

Unlike classic Unix, where init is responsible for handling most or all of the daemons, many of the services on OneFS are delayed from starting because the file system must be up and running first. The isi_mcp daemon is the master control program (MCP) that handles these services, which you can view by using the isi services –a command.

Understanding the types of problems you may encounter

The ability to perform advanced troubleshooting of Isilon clusters is, to a great extent, predicated on fully understanding how the cluster should function. This will help you to understand the types of issues that you might encounter. This product knowledge will drive the troubleshooting process, because it will foster a logical path to follow when diagnosing issues. To learn about OneFS, read as much product documentation as you can. There is a large body of documentation, from the core OneFS Administration Guides through Best Practices guides on EMC Online Support, OneFS release notes, and Knowledgebase articles that you can find at https://support.emc.com. In addition, EMC Isilon will publish customer-facing troubleshooting guides for OneFS in the near future. EMC Isilon also offers a wide variety of training courses on the Isilon product. And you can find information on the EMC Isilon Community site, which connects you to a central hub of information and experts to help you optimize your current storage solution. On the same community site, Isilon Info Hubs consolidate additional helpful product documentation.

 

Isilon recently started offering a free health check for supported clusters—a new, evolving service that you might want to explore. The Isilon health check evaluates the status of your cluster's hardware, software, firmware, events, and fundamental settings. If you're running OneFS 6.5.5 or later, and you have an active maintenance agreement, you can request a health check by creating a Service Request at: https://support.emc.com/servicecenter/CreateSR.


A OneFS cluster is usually part of a complex infrastructure. Determining where the problem lies—and if it is an issue with the cluster, or with the environment—is a key part of troubleshooting. Examples of some external dependencies include:

  • Network (routers, firewalls, and so on)
  • Directory services (Active Directory, LDAP, NIS, and so on)
  • Domain Name Services

When troubleshooting, it is important to remember that issues can originate in these areas—for example, client access issues. If a single client is affected, it’s unlikely to be a network issue, at least if that client can access other hosts. But if a set of clients, or all clients are affected, network connectivity is clearly one of the first things that you’d need to check.

 

Similarly, if a single client is having authentication issues, then it may be a client issue, or an issue with their account. If all clients are having authentication issues, then it may be a problem with the Active Directory domain controllers, or the network between the cluster and the directory services, or with the software on the cluster itself.

Clusters, groups, and quorum

One of the key elements of OneFS as a clustered file system is group communication. A group represents the node and disks that are part of the cluster and their state. Here is output about a group from a five-node lab cluster:

isi_group_info

 

Output:

efs.gmp.group: <4,1654 >: { 1-3:0-11, 4:0-1,3-11,13, 5:0-11 }

 

<4,1654 > indicates there have been 1654 changes in membership since this cluster was installed, and the last node to initiate a membership change was node 4.

 

The information between the curly brackets { } indicates the membership. In this case, it is a five-node cluster, and all nodes are up. If no drives had ever been replaced, it would simply show 1-5:0-11. As it is, two drives have been replaced in node 4, and so the drive numbers are nonsequential, because drive numbers are never reused.

 

To be able to do anything on the cluster, a strict majority of the nodes must be up and able to communicate with each other. Distinct from a quorum, with a majority of nodes up, there are limitations on being able to read and write depending on the number of down nodes.

 

You can see a history of nodes (up/down) and drive stalls by searching for "group change" in the /var/log/messages file.

 

For more information about group changes, see Understanding OneFS Group Changes on the EMC Online Support site. For more information about quorum, see Isilon OneFS, Cluster Quorum, and Data Availability.

 

The divide between software and hardware

Although EMC Isilon hardware leverages industry-standard components, there is no support for using hardware that is not sourced from EMC Isilon. Customers can replace failed hard drives with replacement drives from EMC Isilon. In this case, be very careful to ensure that the FlexProtect Job Engine job has completed and that the drive is marked as gone before removing and replacing it. EMC Isilon support has had far too many cases of potential data loss due to user error here.

Understanding the OneFS file system

Although the file system is the core of OneFS, and although it is extremely complex below the surface, the file system is not directly responsible for many issues. Nevertheless, there are a number of basic concepts to keep in mind.

Managing the cluster reserve

First  of all, space management is crucial. The point of view that "I bought 96 drives, and I want to use them all to 100 percent" is not realistic in an unconstrained environment where files of varying sizes can be created and deleted at will. The initial risk of running too full is performance degradation. The much more serious risk is filling the cluster to the point that it is unusable. This is even more critical if your cluster has multiple pools. If one of the pools is 98 percent full, it doesn't matter that the cluster is only 60 percent full overall. Don't ignore the alerts!

Setting the protection level

Most clusters will be at +2:1, and in OneFS 7.x, EMC took important steps to ensure that no clusters are set at a dangerously inadequate protection level. Nevertheless, the protection level is configurable. Please do not set protection on any cluster to "+1" without first contacting EMC Online Support and understanding the ramifications.

Managing snapshots

OneFS supports path-based snapshots that can be taken on an automated time schedule. Generally, this can be done with no issues. However, caution is absolutely necessary. If you create a vast number of schedules with overlapping delete times, it is possible to create a configuration where the Snapshotdelete Job Engine job cannot keep up with the creation rate. Check your job history and check your snapshot count.

Understanding locking and hangdumps

As part of the clustered system that comprises OneFS, there is a cluster-coherent lock manager that the file system uses to serialize operations in order to maintain consistency.

 

The lock manager implements multiple domains that provide locking services for different components of the file system, such as the LIN domain, the advlock domain (file locking), and the mirrored data structure (MDS) domain.

 

Each lock domain is monitored, and if a thread waits for a lock for longer than a domain-specific timeout, a hangdump is triggered to collect a large amount of diagnostic information in case there is an underlying issue.

 

A hangdump is not necessarily a serious problem. There are certain normal operations, especially on very large files, that have the potential to trigger a hangdump with no long-term ill effects. However, there are also situations where the waiter never gets the lock on the file and users are then impacted.

 

Isilon has tools to analyze the hangdump files and graph out the lock interactions, which are often complex. The hangdump files within /var/crash are compressed text files that can be examined. They are per-node and include a full dump of the lock state as seen by the local node, a dump of every stack of every thread in the system, and various other diagnostics—for example, memory usage.

 

Next >