Reducing the Risk of Data Loss Due to an Improper Cluster Shutdown

NOTE: This topic is part of the Uptime Information Hub.

 

Sophisticated electronics generally have shutdown procedures. For example, when we want to shut down a laptop, we select a shutdown option from a menu. We avoid pressing and holding the power button until the computer abruptly turns off. This is because the risk of bad things happening increases when the computer is improperly shutdown.

 

This is also true for your EMC Isilon cluster. Improperly shutting down your cluster leads to serious risks. The most severe risk is data loss, which increases significantly if the cluster is improperly shut down and is without power for longer than the life of a node’s non-volatile RAM (NVRAM) battery. This article highlights how to minimize this risk by confirming that:

 

  • Data stored in the node’s NVRAM is saved to disk
  • The power indicator light-emitting diode (LED) on the back of the node is off

 

To review the complete shutdown procedure, see the knowledge base (KB) article, “OneFS 6.5 and later: How to safely shut down an Isilon cluster prior to a scheduled power outage (16529),” which is available on the EMC Online Support site (login is required).

 

Saving Data from Journal to Disk

Isilon nodes use NVRAM to protect data in the node’s journal during a power outage. When a client sends writes to a file over a node’s external network or from the backend InfiniBand network, these writes are first stored in the journal, which is stored in NVRAM. The NVRAM is battery-protected. Without a power source, the battery can last 3 or 5 days depending on the type of node.

 

During normal operation, data is moved from the node journal and onto disk for permanent storage. During a proper cluster shutdown, the data in the node’s journal is moved, or “flushed,” to the file system on disk before the node is powered off. When there is an unexpected power outage or the node is improperly powered off while a write to a file is occurring, the requested data writes are protected within the NVRAM. When the node is disconnected from a power supply, the NVRAM batteries begin to drain.

 

The risk of data loss significantly increases if the NVRAM batteries are allowed to completely drain. If data is backed up, recovering it can require a lengthy procedure or a complete cluster rebuild. If data is not backed up, it can be lost.

 

The procedure for safely shutting down your cluster before a scheduled power outage ensures that data is flushed from EMC Isilon Uptime Bulletin 3 the journal to permanent storage on disk, so you don’t have to rely on the journal battery. For a complete written description of how to properly save data in node journals during a cluster shutdown, see KB 16529.

 

Node Power Indicator LED

image015.jpgSometimes the unexpected happens. A hardware or software issue occurs while you’re shutting down your cluster, preventing the node from powering off. Or too many writes need to be completed before the node can be shut down, and the deadline for the scheduled power outage at your data center is quickly approaching. If a node is shut down improperly after running the shutdown command, data could still be stored in NVRAM, and disconnecting power from this node for a period of time longer than 3-5 days increases the risk of data loss.

One of the most important steps to follow during a cluster shutdown is to look at the power indicator (ON/ OFF) LED on the back of the node to confirm that the node is off before disconnecting power.

 

When the node is on, the power indicator LED is illuminated green, as shown in the image to the right. After shutting down the node—through a serial console, the OneFS command-line interface, or the OneFS web administration interface—the power indicator LED is not illuminated. If you followed the shutdown procedure described in KB 16529 and a node’s LED is still lit green, contact Isilon Technical Support.

 

How to Safely Shut Down an EMC Isilon Cluster

The shutdown procedure described in KB 16529 is divided into five phases:

  • Phase 1: Perform preventative maintenance (4-8 weeks before a scheduled shutdown)
  • Phase 2: Shut down each node in the cluster
  • Phase 3: Verify that the nodes have successfully shut down
  • Phase 4: Disconnect the power source
  • Phase 5: Power on each node in your cluster

 

Each phase includes steps for ensuring that this process goes smoothly. The procedure is designed to enable you to respond to issues or make adjustments as needed.

 

For example, Phase 1 can help you identify latent hardware or firmware issues that are undetected until you shut down your cluster. These types of issues can lead to complications that force an improper shutdown of your cluster. Phase 2 contains steps to help ensure that data writes are not occurring before the shutdown procedure and that data is properly flushed from the journal to disk. This phase also provides steps for two shutdown methods: shutting down each node in your cluster sequentially or shutting them down simultaneously. Isilon recommends attaching a serial console to each node to monitor the shutdown procedure and verify that no issues are occurring to cause the node to improperly shut down.

 

If you have concerns about data loss, or need assistance with the proper cluster shutdown procedure at any time, contact Isilon Technical Support.