The demand for storage is continuing to grow exponentially and all predictions suggest it will continue to expand at a very aggressive rate for the foreseeable future. To effectively protect a file system in the multi-petabyte size range requires an extensive use of multiple data availability and data protection technologies.


In tandem with this trend, the demand for ways to protect and manage that storage also increases. Today, several strategies for data protection are available and in use. If data protection is perceived as a continuum, at the beginning lies high availability. Without high availability technologies such as drive, network and power redundancy, data loss and its subsequent recovery would be considerably more prevalent.


Technologies like replication, synchronization and snapshots, in addition to traditional NDMP-based backup, are mainstream and established within the data protection realm. Snapshots offer rapid, user-driven restores without the need for administrative assistance, while synchronization and replication provide valuable tools for business continuance and offsite disaster recovery.


Some of these methods are biased towards cost efficiency but have a higher risk associated with them, and others represent a higher cost but also offer an increased level of protection. Two ways to measure cost versus risk from a data protection point of view are:


  • Recovery Time Objective (RTO): RTO is the allotted amount of time within a Service Level Agreement (SLA) to recover data.


For example, an RTO of four hours means data must be restored and made available within four hours of an outage.


  • Recovery Point Objective (RPO): RPO is the acceptable amount of data loss that can be tolerated per an SLA.


With an RPO of 30-minutes, this is the maximum amount of time that can elapse since the last backup or snapshot was taken.


The following chart illustrates how the core components of the Isilon data protection portfolio align with the notion of an availability and protection continuum and associated recovery objectives.


ndmp_1.png


OneFS’ NDMP solution, at the high end of the recovery objective continuum, receives a number of feature and functionality enhancements in OneFS 8.2. These include:


New Feature

Benefit

NDMP Redirector and Throttler

CPU usage management for NDMP backup and restore operations

ComboCopy for CloudPools

More options for CloudPools files backup

Fiber channel/ethernet controller

2-Way NDMP solution for Gen 6 Isilon nodes

 

NDMP is an open-standard protocol that provides interoperability with leading data-backup products and Isilon supports both NDMP versions 3 and 4. OneFS also provides support for both direct NDMP (referred to as 2-way NDMP), and remote NDMP (referred to as 3-way NDMP) topologies.


In the remote, 3-way NDMP scenario, there are no fibre channel connectors present in the Isilon cluster. Instead, the DMA uses NDMP over the LAN to instruct the cluster to start backing up data to the tape server - either connected via Ethernet or directly attached to the DMA host. In this model, the DMA also acts as the Backup/Media Server.


ndmp_2.png


During the backup, file history is transferred from the cluster via NDMP over the LAN to the backup server, where it is maintained in a catalog. In some cases, the backup application and the tape server software both reside on the same physical machine.


Direct, 2-way NDMP is typically the more efficient of the two models and results in the fastest transfer rates. Here, the data management application (DMA) uses NDMP over the Ethernet front-end network to communicate with the Isilon cluster.


ndmp_3.png


On instruction, the cluster, which is also the NDMP tape server, begins backing up data to one or more tape devices which are attached to it via Fibre Channel.


The DMA, a separate server, controls the tape library’s media management. File History, the information about files and directories, is transferred from the cluster via NDMP to the DMA, where it is maintained in a catalog.

Prior to 8.2,  2-way NDMP typically involved running the NDMP sessions primarily on dedicated Backup Accelerator (BA) nodes within a cluster. However, the BA nodes required that the cluster use an Infiniband, rather than Ethernet, backend.


OneFS 8.2 now enables a Fibre Channel HBA (host bus adapter) to be installed in Isilon Gen6 storage nodes with an Ethernet backend to support 2-way NDMP to tape devices and VTLs. The FC card itself is a hybrid 4-port (2 x 10GbE & 2 x 8Gb FC) HBA. These HBAs are installed in a paired configuration, meaning a Gen 6 chassis will contain either two or four cards per chassis.  When installed, a hybrid card will replace a node’s front-end Ethernet NIC and is supported on new Gen 6 hardware, plus legacy Gen 6 nodes via an upgrade process.


There are several tools in OneFS to facilitate information gathering and troubleshooting of the hybrid HBAs:


CLI Utility

Description

camcontrol

Provides general system status.

mt

Standard UNIX command for controlling tape devices.

chio

Standard UNIX command for controlling media changers.

sysctl dev.ocs_fc

Used to gather configuration information.

 

For example, the following syntax can be used to quickly verify is tape devices are present on a cluster:


# camcontrol devlist

<STK L180 0306>                    at scbus1 target 5 lun 0 (pass23,ch0)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 1 (sa0,pass24)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 2 (sa1,pass25)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 3 (sa2,pass26)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 4 (sa3,pass27)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 5 (sa4,pass28)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 6 (sa5,pass29)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 7 (sa6,pass30)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 8 (sa7,pass31)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun 9 (sa8,pass32)

<IBM ULTRIUM-TD5 8711>             at scbus1 target 5 lun a (sa9,pass33)

<STK L180 0306>                    at scbus1 target 6 lun 4 (pass34,ch1)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun 5 (sa10,pass35)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun 6 (sa11,pass36)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun 7 (sa12,pass37)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun 8 (sa13,pass38)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun 9 (sa14,pass39)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun a (sa15,pass40)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun b (sa16,pass41)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun c (sa17,pass42)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun d (sa18,pass43)

<IBM ULTRIUM-TD3 8711>             at scbus1 target 6 lun e (sa19,pass44)

<HP Ultrium 5-SCSI I6RZ>           at scbus1 target 12 lun 0 (sa20,pass45)

<HP Ultrium 5-SCSI I6RZ>           at scbus1 target 13 lun 0 (sa21,pass46)

<HP Ultrium 5-SCSI I6RZ>           at scbus1 target 14 lun 0 (sa22,pass47)

<HP Ultrium 5-SCSI I6RZ>           at scbus1 target 15 lun 0 (sa23,pass48)

<QUANTUM Scalar i500 681G>         at scbus1 target 15 lun 1 (pass49,ch2)

 

In this case, the presence of ‘sa’ devices in the output confirms the presence of tapes.


Also introduced in OneFS 8.2 is an NDMP Redirector for 2-way Backup, which automatically redistributes 2-Way NDMP backup or restore operations.


Each NDMP session runs with multiple threads of execution involving data movement. When multiple such NDMP sessions run on a single node, resource constraints come into play, which may cause the slow down of the NDMP processes or other processes running on the same node. The current NDMP architecture does not provide a method to fan out sessions to multiple nodes because of DMA and tape infrastructure constraints. With the introduction of Infinity platform, the utility of BA nodes is completely nullified. This results in having NDMP operations run on storage nodes with limited memory.


So having multiple NDMP operations running on a single node can cause performance issues to arise. This feature is to allow an NDMP operation to be redirected to different nodes on the cluster, which would reduce the memory and CPU contention issues on the node that had the session initiated and result in better overall load balancing and performance.


By having NDMP workloads redirected to different nodes, OneFS overcomes the potential for high memory utilization of NDMP workflows on a single node such as the Backup Accelerator. Also, the DMA can view the cluster as a whole, plus communicate with individual nodes to initiate a backup job with the desired load balancing criteria.


When a DMA initiates a backup session, it communicates with data server and tape server via a series of messages sent through the cluster. In order for the local backup sessions to have their data server or tape server migrate to other nodes, the following four aspects are required: 


Aspect

Description

Resource discovery

Nodes with available tapes in the cluster need to be discovered.

Resource Allocation

Assign resources for operations dynamically based on resource location and system load.

NDMP session load

Require stat information of active NDMP sessions and load on each node as proper load balancing criteria.

Agent

This runs on a node to redirect the data server or tape server to NDMP appropriate nodes.

 

The above operations involve the interception and internal functional response modification of a few protocol messages and is built on top of existing implementation.


Note that the NDMP redirector is only configurable through CLI via the following syntax:


# isi ndmp settings global modify --enable-redirector true


Also included in 8.2 is an NDMP Throttler, which manages CPU usages of NDMP backup and restore operations. This feature is designed specifically for Gen6 hardware which allows local NDMP operations on storage nodes. It operates by limiting NDMP’s CPU usage such that it does not overwhelm the nodes and impact client I/O and other cluster services. Like the redirector, the NDMP throttler is also only configurable through CLI. For example:


# isi ndmp settings global modify --enable-throttler true

# isi ndmp settings global view

                       Service: False

                          Port: 10000

                           DMA: generic

Bre Max Num Contexts: 64

MSB Context Retention Duration: 300

MSR Context Retention Duration: 600

Stub File Open Timeout: 10

Enable Redirector: False

Enable Throttler: False

Throttler CPU Threshold: 50

In addition, CPU usage is controlled by a CPU threshold value.  The default value is ‘50’, which means NDMP sessions should use less than 50% of CPU resources on each node. This threshold value is configurable, for example:


# isi ndmp settings global modify –throttler-cpu-threshold 80


The throttler influences both 2-Way and 3-Way data server operations and its settings are global to all the nodes in the cluster.


Finally, OneFS 8.2 also introduces NDMP Combo Copy for CloudPools. Stub files can contain sparse data, and CloudPools maintains a map cataloging each stub file’s non-sparse regions. Recalling a stub file from CloudPools can recover sparseness of the file. However, during a deep copy backup, NDMP does not recognize the sparse map and treats all CloudPools stub files as fully populated files. This means all stub files are expanded to their full size with sparse regions filled with zeros. This prolongs the backup time and enlarges the backup stream. In addition, the sparseness of the original files cannot be restored during a recovery. To address this, NDMP Combo Copy maintains the sparseness of CloudPools sparse files during a deep copy backup.

Here are the three CloudPools copy options available in OneFS 8.2:


Copy Type

Operation

Description

Value

Deep Copy

Backup

Back up files as regular files or unarchived files, and files can only be restored as regular files.

0x100

Shallow Copy

Backup

Back up files as SmartLink files without file data, and files can only be restored as SmartLink files.

0x200

Combo Copy

Backup

Back up files as SmartLink files with file data, and files can be restored as regular files or SmartLink files.

0x400

Deep Copy

Restore

Restore files as regular files.

0x100

Shallow Copy

Restore

Restore files a SmartLink files.

0x200

 

Note that both the DeepCopy and ComboCopy backups recall file data from Cloud, but the data is just for backup purposes and is not stored on disks. However, be aware that the recall of file data may incur charges from Cloud vendors.


Under the hood, CloudPools divides up file data into 20MB chunks, with each described by a CDO (cloud data object) that maps the chunk into cloud objects. Each CDO has a bit map to represent non-sparse regions of the data chunk. NDMP can read the CDOs of a stub file in order to construct a sparse map for the file, and then just back up non-sparse regions. On tape, the file looks like a regular sparse file, and therefore can be restored appropriately during a recovery.

 

Configuration is via the CLI and include the following syntax:

 

# isi ndmp settings variables create <backup path> <variable name> <value>


The <backup path> argument specifies a specific backup root, which is the same as the value of FILESYSTEM environment variable during backup.


The /BACKUP and /RESTORE arguments are global to all backups and restores respectively.


For example, the following CLI syntax configures ndmp for combo copy:


# isi ndmp settings variables /BACKUP BACKUP_OPTIONS 0x400


And for a deep restore:


# isi ndmp settings variables /RESTORE RESTORE_OPTIONS 0x100


Also included in OneFS 8.2 is an NDMP version check, which prevents recovery of incompatible SmartLink files. NDMP Backup automatically includes information of features activated by a Smartlinked file. Similarly, NDMP restore validates features required by a Smartlinked file and skips the file if the target cluster does not support those features.


For example, Cloudpools provides both AWS S3 v2 and v4 authentication support. If v4 Auth is required, then a Smartlinked file cannot be recovered to a cluster which only supports v2 Auth. Similarly, if SmartLinked files are backed up as stubs on a cluster running OneFS 8.2, which debuts CloudPools v2.0, they cannot be restored to a cluster running an earlier version of OneFS and CloudPools v1.0.


Note that there is no configuration option for disabling version checking.