Find Communities by: Category | Product

trimbn

OneFS & Cluster Quorum

Posted by trimbn Jan 31, 2018

There have been several inquiries from the field of late around node unavailability conditions under which OneFS will place an entire cluster, or an individual node pool, into read-only mode.


If an Isilon cluster does not have quorum it will automatically be marked as read-only. In this state, it will not accept write requests from any protocols, regardless of any particular node pool membership issues.


However, if a cluster has quorum but a given node pool ends up having only two nodes online due to unavailability of a node, that node pool will still be writable. Note: The node pool couldn’t be provisioned this way, and would have had to have started out with a minimum of three nodes. Furthermore, if this degraded node pool goes down to just one node online, it won't accept writes. However, you will still be able read from it.


Depending on the protection policy that was used to write the data, it’s not necessarily the case that all data stored on that node pool will be readable with only one node available. If, for example, the node pool originally comprised three nodes and the protection policy was set to +2n, everything will be readable. But, if the node pool contained more than three nodes originally and only one is up, there will be some data which is unavailable.


In order for OneFS to properly function and accept data writes, a quorum of nodes must be active and responding. As such, quorum is defined as a simple majority: A cluster with N nodes must have [N/2]+1 nodes online in order to allow writes.


This same quorum requirement is also true for individual node pools within a heterogeneous cluster, such that a minimum of three nodes of a specific hardware config are needed in order to create a new node pool.


Isilon clustering is based on the CAP theorem, which states that it is impossible for a distributed system to simultaneously provide more than two out of the following three guarantees:


Cluster_quorum-1.PNG.png


OneFS does not compromise on Consistency and Availability, and uses a simple quorum to prevent partitioning, or ‘split-brain’ conditions that can be introduced if the cluster should temporarily divide into two clusters. The quorum rule guarantees that, regardless of how many nodes fail or come back online, if a write takes place, it can be made consistent with any previous writes that have ever taken place.


So, quorum dictates the number of nodes required in order to move to a given data protection level. For an erasure-code (FEC) based protection-level of N+M, the cluster must contain at least 2M+1 nodes. For example, a minimum of seven nodes is required for a +3n protection level. This allows for a simultaneous loss of three nodes while still maintaining a quorum of four nodes for the cluster to remain fully operational.


If a cluster does drop below quorum, the file system will automatically be placed into a protected, read-only state - denying writes, but still allowing read access to the available data. in the instances where a protection level is set too high for OneFS to achieve using FEC, the default behavior is to protect using mirroring instead. Obviously, this has a negative impact on space utilization.


Since OneFS does not compromise on consistency, so a mechanism is required to manage a cluster’s transient state and quorum. As such, the primary role of the OneFS Group Management Protocol (GMP) is to help create and maintain a group of synchronized nodes.

 

A group is a given set of nodes which have synchronized state, and a cluster may form multiple groups as connection state changes. Quorum is a property of the GMP group, which helps enforce consistency across node disconnects and other transient events.Having a consistent view of the cluster state is crucial, since initiators need to know which node and drives are available to write to, etc.

 

GMP distributes a variety of state information about nodes and drives, from identifiers to usage statistics. The most fundamental of these is the composition of the cluster, or ‘static aspect’ of the group, which is stored in the array.xml file. This file also includes info such as the ID, GUID, and whether the node is diskless or storage, plus attributes not considered part of the static aspect, such as internal IP addresses, etc.

 

Looking at this a layer deeper, since nodes and drives in OneFS may be readable, but not writable, OneFS actually has two quorum properties:

 

  • Read quorum
  • Write quorum

 

Read quorum is indicated by the ‘sysctl efs.gmp.has_quorum’ command having a value of ‘1’ and, similarly, write quorum is demonstrated by ‘sysctl efs.gmp.has_super_block_quorum’ having a value of ‘1’.

 

Bear in mind that nodes which are not in the quorum group form multiple groups. In OneFS lexicon, a group of nodes with quorum is referred to as the majority side. Conversely, any group without quorum is frequently referred to as a minority side. By definition, there can only be one majority group, but there may be multiple minority groups. A group which has any components in a down or unavailable state is referred to as ‘degraded’.

 

File system operations typically query the GMP group several times before completing. The group may

change over the course of the operation, but the operation needs a consistent view. Many functions check for appropriate group state, which would be impossible if the group state could change between a predicate and action. This consistent view is provided by the group info, which is the primary interface modules use to query group state.

 

Under normal operating conditions, every node and its requisite disks are part of the current group, and the group’s status can be viewed by running either ‘sysctl efs.gmp.group’ or ‘isi readonly list’ on any node of the cluster.

 

For example, on a degraded three node X210 cluster with node # down:

# sysctl efs.gmp.group

  1. efs.gmp.group: <2,288>: { 1-2:0-11, down: 3, smb: 1-2, nfs: 1-2, hdfs: 1-2, swift: 1-2, all_enabled_protocols: 1-2 }

 

The above group information comprises three main parts:

 

  • Sequence number:  Provides identification for the group (ie.<2,288>’ )
  • Membership list:      Describes the group (ie.1-2:0-11, down 3’
  • Protocol list: Shows which nodes are supporting which protocol services

(ie. ‘smb: 1-2, nfs: 1-2, hdfs: 1-2, swift: 1-2, all_enabled_protocols: 1-2’)

 

If even more detail is desired, the syscl efs.gmp.current_info command will report current GMP information, plus several other pieces of data which need a consistent view across the cluster:

 

Nodes in the process of SmartFailing are listed both separately and in the regular group. For example, node 2 in the following:

{ 1-3:0-23, soft_failed: 2 }

 

However, when the FlexProtect completes, the node will be removed from the group. A SmartFailed node that’s also unavailable will be noted as both down and soft_failed. For example:

 

{ 1-3:0-23, 5:0-17,19-24, down: 4, soft_failed: 4 …}

 

Similarly, when a node is offline, the other nodes in the cluster will show that node as down:

 

{ 1-2:0-23, 4:0-23,down: 3 }

 

Note that no disks for that node are listed, and that it doesn't show up in the group.

 

If the node is split from the cluster—that is, if it is online but not able to contact other nodes on its back-end network—that node will see the rest of the cluster as down. Its group might look something like this instead:

 

{ 6:0-11, down: 3-5,8-9,12 }

 

Like nodes, drives may be SmartFailed and down, or SmartFailed but available. The group statement looks similar to that for a SmartFailed or down node, only the drive Lnum (logical node number) is also included. For example:

 

{ 1-4:0-23, 5:0-6,8-23, 6:0-17,19-24, down: 5:7, soft_failed: 5:7 }

 

This indicates that node id 5 drive Lnum 7 is both SmartFailed and unavailable. If the drive was SmartFailed but still available, the group would read:

 

{ 1-4:0-23, 5:0-6,8-23, 6:0-17,19-24, soft_failed: 5:7 }

 

When multiple devices are down, consolidated group statements can be tricky to read. For example, if node 1 was down, and drive 4 of node 3 was down, the group statement would read:

 

{ 2:0-11, 3:0-3,5-11, 4-5:0-11, down: 1, 3:4, soft_failed: 1, 3:4 }

 

As mentioned in the previous GMP blog article, OneFS has a read-only mode. Nodes in a read-only state are clearly marked as such in the group:

 

{ 1-6:0-8, soft_failed: 2, read_only: 3 }

 

Node 3 is listed both as a regular group member and called out separately at the end, because it’s still active. It’s worth noting that "read-only" indicates that OneFS will not write to the disks in that node. However, incoming connections to that node are still able write to other nodes in the cluster.

 

Non-responsive, or dead, nodes appear in groups when a node has been permanently removed from the cluster without SmartFailing the node. For example, node 11 in the following:

 

{ 1-5:0-11, 6:0-7,9-12, 7-10,12-14:0-11, 15:0-10,12, 16-17:0-11, dead: 11 }

 

Drives in a dead state include a drive number as well as a node number. For example:

{ 1:0-11, 2:0-9,11, 3:0-11, 4:0-11, 5:0-11, 6:0-11, dead: 2:10 }

 

In the event of a dead node, the recommended course of action is to immediately start a FlexProtect and contact Isilon Support.

The release of OneFS 8.1.1.0 sees the native integration of OneFS support for Cloudera Navigator into the base operating system without the need for any patch to support the Navigator features. https://support.emc.com/docu87518_Isilon-OneFS-8.1.1-Release-Notes.pdf?language=en_US

 

For additional information on Isilon and Navigator integration see the following post: https://community.emc.com/community/products/isilon/blog/2017/10/02/onefs-and-cloudera-navigator-support

This post will review the steps for enabling Navigator integration with OneFS 8.1.1.0 with CDH 5.13 or greater. OneFS 8.1.1.0 will need to be in a committed state to enable the FSImage and INotify functions to be enabled.

 

 

 

1. If Cloudera Navigator is not enabled within Cloudera Manager, install the components. But do not start the Metadata Server (if it is already installed stop the Metadata Server)

 

1.png

 

 

2. Enable FSImage and INotify on the CDH Access Zone, OneFS 8.1.1.0 now has these features added to the WebUI.

 

2.png

 

 

or

 

isi hdfs inotify settings modify --enabled=true --zone=zone2-cdh --verbose

Updated HDFS INotify settings:

enabled: False -> True


isilon01-1# isilon01-1# isi hdfs fsimage settings modify --enabled=true --zone=zone2-cdh --verbose

Updated HDFS FSImage settings:

enabled: False -> True



isilon01-1# isi hdfs inotify settings view --zone=zone2-cdh

      Enabled: Yes

Maximum Delay: 1m

    Retention: 2D


isilon01-1# isi hdfs fsimage settings view --zone=zone2-cdh

Enabled: Yes

 

 

 

3. Review and Modify the Navigator Configuration.

 

Since the Isilon Service is enabled in Cloudera Manager and no HDFS service is present, Navigator is configured for Isilon integration automatically.

3.png

If the Cloudera cluster is Kerberized review the following procedure:


We need to modify the Kerberos Principal that the Metadata Server connects to Isilon as; since the FSimage file and INotify logs are stored outside of the Hadoop root within OneFS, the Principal is required to access the cluster as root to gain access to these log files in a protected part of the ifs file system.


The default account of hue (as seen here) will throw the following errors in the logs.

2-5.png

 

In the Isilon hdfs.log:

 

2018-01-17T16:34:23-05:00 <30.7> isilon01-1 hdfs[3299]: [hdfs] Initializing Connection context AuthType: 81, EffectiveUser: hue/centos-06.foo.com@FOO.COM, RealUser: hue/centos-06.foo.com@FOO.COM

2018-01-17T16:34:23-05:00 <30.6> isilon01-1 hdfs[3299]: [hdfs] ImageTransfer: code 403 error Access denied: user=hue/centos-06.foo.com@FOO.COM desired=1179785 available=1048704 path="/onefs_hdfs/ifs/.ifsvar/modules/hdfs_d/fsimage/3/42"

2018-01-17T16:34:24-05:00 <30.6> isilon01-1 hdfs[3299]: [hdfs] ImageTransfer: code 403 error Access denied: user=hue/centos-06.foo.com@FOO.COM desired=1179785 available=1048704 path="/onefs_hdfs/ifs/.ifsvar/modules/hdfs_d/fsimage/3/42"

 

In the Navigator Metadata Server log

 

4:36:37.194 PM        ERROR        HdfsExtractorShim

[CDHExecutor-0-CDHUrlClassLoader@5aafd51a]: Internal Error while extracting

java.lang.RuntimeException: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: http://isilon01-cdh.foo.com:8082/imagetransfer?getimage=1&txid=latest&user.name=hue/centos-06.foo.com@FOO.COM, status: 403, message: Forbidden

at com.cloudera.nav.hdfs.extractor.HdfsImageExtractor.doImport(HdfsImageExtractor.java:101)

at com.cloudera.nav.hdfs.extractor.HdfsExtractorShim$1.run(HdfsExtractorShim.java:296)

at com.cloudera.nav.hdfs.extractor.HdfsExtractorShim$1.run(HdfsExtractorShim.java:293)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)

at com.cloudera.cmf.cdh5client.security.UserGroupInformationImpl.doAs(UserGroupInformationImpl.java:44)

at com.cloudera.nav.hdfs.extractor.HdfsExtractorShim.doImport(HdfsExtractorShim.java:293)

at com.cloudera.nav.hdfs.extractor.HdfsExtractorShim.doExtraction(HdfsExtractorShim.java:248)

at com.cloudera.nav.hdfs.extractor.HdfsExtractorShim.run(HdfsExtractorShim.java:144)

at com.cloudera.cmf.cdhclient.CdhExecutor$RunnableWrapper.call(CdhExecutor.java:221)

at com.cloudera.cmf.cdhclient.CdhExecutor$RunnableWrapper.call(CdhExecutor.java:211)

at com.cloudera.cmf.cdhclient.CdhExecutor$CallableWrapper.doWork(CdhExecutor.java:236)

at com.cloudera.cmf.cdhclient.CdhExecutor$SecurityWrapper$1.run(CdhExecutor.java:189)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)

at com.cloudera.cmf.cdh5client.security.UserGroupInformationImpl.doAs(UserGroupInformationImpl.java:44)

at com.cloudera.cmf.cdhclient.CdhExecutor$SecurityWrapper.doWork(CdhExecutor.java:186)

at com.cloudera.cmf.cdhclient.CdhExecutor$1.call(CdhExecutor.java:125)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: http://isilon01-cdh.foo.com:8082/imagetransfer?getimage=1&txid=latest&user.name=hue/centos-06.foo.com@FOO.COM, status: 403, message: Forbidden

at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:425)

at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:415)

at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.downloadMostRecentImageToDirectory(TransferFsImage.java:98)

at org.apache.hadoop.hdfs.tools.DFSAdmin$1.run(DFSAdmin.java:856)

at org.apache.hadoop.hdfs.tools.DFSAdmin$1.run(DFSAdmin.java:853)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)

at org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:477)

at org.apache.hadoop.security.SecurityUtil.doAsCurrentUser(SecurityUtil.java:471)

at org.apache.hadoop.hdfs.tools.DFSAdmin.fetchImage(DFSAdmin.java:853)

at com.cloudera.cmf.cdh5client.hdfs.DFSAdminImpl.fetchImage(DFSAdminImpl.java:29)

at com.cloudera.nav.hdfs.extractor.HdfsImageFetcherImpl.fetchImage(HdfsImageFetcherImpl.java:15)

at com.cloudera.nav.hdfs.extractor.HdfsImageExtractor.doImport(HdfsImageExtractor.java:81)

... 23 more

Caused by: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: http://isilon01-cdh.foo.com:8082/imagetransfer?getimage=1&txid=latest&user.name=hue/centos-06.foo.com@FOO.COM, status: 403, message: Forbidden

at org.apache.hadoop.security.authentication.client.AuthenticatedURL.extractToken(AuthenticatedURL.java:286)

at org.apache.hadoop.security.authentication.client.PseudoAuthenticator.authenticate(PseudoAuthenticator.java:77)

at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:212)

at org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:220)

at org.apache.hadoop.hdfs.web.URLConnectionFactory.openConnection(URLConnectionFactory.java:161)

at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.doGetUrl(TransferFsImage.java:422)

... 36 more

 

 

 

In order to provide access to the fsimage file that resides under /ifs/.ifsvar/modules/hdfs_d/fsimage we need to modify the Kerberos principal that will access the fsimage files. The simplest modification is to use the existing hdfs principal. Since a hdfs => root mapping already exists within the Isilon Access Zone, modifying the Navigator configuration to use the hdfs principal will provide access and allow Navigator to query fsimage correctly.

 

 

3-5.png

 

 

4. Having modified the Navigator principal, start the Navigator MetaDataServer. Initially, it will take time for the different entities to show up as navigator catalogs and links different services into Navigator.

 

Only the cluster is present initially.

4.png

 

Impala added

5.png

 

Hive and Yarn added

7.png

 

The HDFS service(Isilon) often is the last to show up in Navigator based on the polling intervals set and how frequently the FSImage and INotify logs are accessed. Once it shows up HDFS file system events will be present. Create some test files and run some test jobs to validate functionality.

 

8.png

 

 

It may also take additional time for the lineage to link hdfs files and operations, based on how Navigator is configured.

 

9.png

 

 

To recap the Best Practices for using Navigator with Isilon:

- Enable FSImage and INotify prior to Cloudera deployment of Navigator

- There is no known need to adjust the duration of HDFS FSImage and INotify jobs unless instructed to by support

- Do not toggle FSimage and INotify on & off on a zone, once set leave enabled

- Do not enable FSimage and INotify on zones that do not support Cloudera and Navigator

 

 

 

 

 

Russ Stevenson

Isilon

Using Hadoop with Isilon - Isilon Info Hub

Had a couple of recent inquires from the field about estimating OneFS’ protection overhead and usable capacity, so thought it would make an interesting article.


Let’s take, for example, a five node S210 cluster configured with the recommended protection level of +2d:1n and a dataset comprising medium and large files. What sort of usable capacity could be expected?


The protection policy of +2d:1n on this cluster means that it can survive two simultaneous drive failures or one entire node failure without data loss or unavailability.


The chart below answers such storage overhead questions across a range of OneFS protection level options and node counts.


For each field in this chart, the storage overhead numbers are calculated by dividing the sum of the two numbers by the number on the right.


n+m => m/(n+m)


So, for the 5 nodes @ +2d:1n example above, the chart shows that it’s an 8+2 layout (see green highlighted field below):


8+2 => 2/(8+2) = 20%


Number

of nodes

[+1n]

[+2d:1n]

[+2n]

[+3d:1n]

[+3d:1n1d]

[+3n]

[+4d:1n]

[+4d:2n]

[+4n]

3

2 +1

(67%)

4 + 2

(67%)

6 + 3

(67%)

3 + 3

(50%)

8 + 4

(67%)

4

3 +1

(75%)

6 + 2

(75%)

9 + 3

(75%)

5 + 3

(62%)

12 + 4

(75%)

4 + 4

(50%)

5

4 +1

(80%)

8 + 2

(80%)

3 + 2

(60%)

12 + 3

(80%)

7 + 3

(70%)

16 + 4

(80%)

6 + 4

(60%)

6

5 +1

(83%)

10 + 2

(83%)

4 + 2

(67%)

15 + 3

(83%)

9 + 3

(75%)

16 + 4

(80%)

8 + 4

(67%)

 

This translates to 20% protection overhead and 80% usable capacity.


The m+n numbers in each field in the table also represent how files are striped across a cluster for each node count and protection level.


Storage_efficiency_1.png


For example, with +2d:1n protection on a 5-node cluster, OneFS will write a double stripe across all 5 nodes (total of 10 stripe units), and use and eight for data (m) and two of these stripe units for ECC parity (n). This is illustrated in the following diagram:


The general storage efficiency will look something like the percentages in the table below.


Be aware that the estimated storage usable capacity (% value in brackets) is a very rough guide and will vary considerably across different datasets, depending on the quantity of small files, etc.

 


Number

of nodes

[+1n]

[+2d:1n]

[+2n]

[+3d:1n]

[+3d:1n1d]

[+3n]

[+4d:1n]

[+4d:2n]

[+4n]

3

2 +1

(67%)

4 + 2

(67%)

6 + 3

(67%)

3 + 3

(50%)

8 + 4

(67%)

4

3 +1

(75%)

6 + 2

(75%)

9 + 3

(75%)

5 + 3

(62%)

12 + 4

(75%)

4 + 4

(50%)

5

4 +1

(80%)

8 + 2

(80%)

3 + 2

(60%)

12 + 3

(80%)

7 + 3

(70%)

16 + 4

(80%)

6 + 4

(60%)

6

5 +1

(83%)

10 + 2

(83%)

4 + 2

(67%)

15 + 3

(83%)

9 + 3

(75%)

16 + 4

(80%)

8 + 4

(67%)

7

6 +1

(14%)

12 + 2

(86%)

5 + 2

(71%)

15 + 3

(83%)

11 + 3

(79%)

4 + 3

(67%)

16 + 4

(80%)

10 + 4

(71%)

8

7 +1

(87%)

14 + 2

(87.5%)

6 + 2

(75%)

15 + 3

(83%)

13 + 3

(81%)

5 + 3

(62%)

16 + 4

(80%)

12 + 4

(75%)

9

8 +1

(89%)

16 + 2

(89%)

7 + 2

(78%)

15 + 3

(83%)

15 + 3

(83%)

6 + 3

(67%)

16 + 4

(80%)

14 + 4

(78%)

5 + 4

(66%)

10

9 +1

(90%)

16 + 2

(89%)

8 + 2

(80%)

15 + 3

(83%)

15 + 3

(83%)

7 + 3

(70%)

16 + 4

(80%)

16 + 4

(80%)

6 + 4

(60%)

12

11 +1

(8%)

16 + 2

(89%)

10 + 2

(83%)

15 + 3

(83%)

15 + 3

(83%)

9 + 3

(75%)

16 + 4

(80%)

16 + 4

(80%)

6 + 4

(60%)

14

13 +1

(7%)

16 + 2

(89%)

12 + 2

(86%)

15 + 3

(83%)

15 + 3

(83%)

11 + 3

(79%)

16 + 4

(80%)

16 + 4

(80%)

10 + 4

(71%)

16

15 +1

(6%)

16 + 2

(89%)

14 + 2

(87%)

15 + 3

(83%)

15 + 3

(83%)

13 + 3

(81%)

16 + 4

(80%)

16 + 4

(80%)

12 + 4

(75%)

18

16 +1

(6%)

16 + 2

(89%)

16 + 2

(89%)

15 + 3

(83%)

15 + 3

(83%)

15 + 3

(83%)

16 + 4

(80%)

16 + 4

(80%)

14 + 4

(78%)

20

16 +1

(6%)

16 + 2

(89%)

16 + 2

(89%)

16 + 3

(84%)

16 + 3

(84%)

16 + 3

(84%)

16 + 4

(80%)

16 + 4

(80%)

14 + 4

(78%)

30

16 +1

(6%)

16 + 2

(89%)

16 + 2

(89%)

16 + 3 (

84%)

16 + 3

(84%)

16 + 3

(84%)

16 + 4

(80%)

16 + 4

(80%)

14 + 4

(78%)

Take the following scenario:


“A six node X410 cluster with 1TB drives has a recommended protection level of +2d:1n. The cluster is already 87% full, so a new node addition is required to increase the capacity. At seven nodes the recommended protection level changes to +3d:1n1d. What’s the most efficient way to do this?”


In essence, should the protection level be changed before or after add the new node?


Since the main objective here is efficiency, limiting the amount of protection and layout work that OneFS has to perform is desirable.


In this case, both node addition and a change in protection level require the cluster’s restriper daemon to run. This entails two long running operations: First, to balance data evenly across the seven node cluster. Then to increase the on-disk data protection.


Say a new node is added first, and then the cluster protection is changed to the recommend level for the new configuration, the process would look like:


1)   Add a new node to the cluster

2)   Let rebalance finish

3)   Configure the data protection level to +3d:1n1d

4)   Allow the restriper to complete the re-protection

 

However, by addressing the protection level change first, all the data restriping can be performed more efficiently and in a single step:


1)   Change the protection level setting to +3d:1n1d

2)   Add nodes (immediately after changing the protection level)

3)   Let rebalance finish

 

In addition to reducing the amount of work the cluster has to do, this streamlined process also has the benefit of getting data re-protected at the new recommended level more quickly.


OneFS protects and balances data by writing file blocks across multiple drives on different nodes. This process is known as ‘restriping’ in Isilon jargon. The Job Engine defines a restripe exclusion set that contains those jobs which involve file system management, protection and on-disk layout. The restripe set encompasses the following jobs:


Job

Description

Autobalance(Lin)

Balance free space in a cluster

FlexProtect(Lin)

Scans file system after device failure to ensure all files remain protected

MediaScan         

Locate and clear media-level errors from disks

MultiScan           

Runs AutoBalance and Collect jobs concurrently

SetProtectPlus

Applies the default file policy (unless SmartPools is activated)

ShadowStoreProtect

Protect shadow stores

SmartPools        

Protects and moves data between tiers of nodes within cluster

Upgrade

Manages OneFS version upgrades

 

Each of the Job Engine jobs has an associated restripe goal, which can be displayed with the following command:


# isi_gconfig -t job-config | grep restripe_goal


The different restriper functions operate as follows, where each in the path is a superset of the previous:


adding_nodes_&amp;_changing_protection_1.png


The following table describes the action and layout goal of each restriper function:


Function

Detail

Goal

Retune

Always restripe using the retune layout goal. Originally intended to optimize layout for performance, but has instead become a synonym for ‘force restripe’.

LAYOUT_RETUNE

Rebalance

Attempt to balance utilization between drives etc. Also address all conditions implied by REPROTECT.

LAYOUT_REBALANCE

Reprotect

Change the protection level to more closely match the policy if the current cluster state allows wider striping or more mirrors. Re-evaluate the disk pool policy and SSD strategy. Also address all conditions implied by REPAIR.

LAYOUT_REPAIR

Repair

Replaces any references to restripe_from (down/smartfailed) components. Also fixes recovered writes.

LAYOUT_REPAIR

 

Here’s how the various Job Engine jobs (as reported by the isi_gconfig –t job-config command above) align with the four restriper goals:


adding_nodes_&amp;_changing_protection_2.png


The retune goal moves the current disk pool to the bottom of the list, increasing the likelihood (but not guaranteeing) that another pool will be selected as the restripe target. This is useful, for example, in the event of a significant drive loss in one of the disk pools that make up the node’s pool (eg. disk pool 4 suffers loss of 2+ drives and it becomes > 90% full). Using a retune goal more ‘quickly’ forces rebalance to the other pools.


So, an efficient approach to the earlier cluster expansion scenario is to change protection and then add the new node. A procedure for this is as follows:


1.     Reconfigure the protection level to the recommended setting for the appropriate node pool(s). This can be done from the WebUI by navigating to File System > Storage Pools > SmartPools and editing the appropriate node pool(s):


adding_nodes_&amp;_changing_protection_3.png

 

2.     Especially for larger clusters (ie. twenty nodes or more), of if there’s a mix of node hardware generations, it’s helpful to do some prep work upfront prior to adding a node. Prior to adding node(s):


     a. Image any new node to the same OneFS version that the cluster is running.


     b. Ensure that any new nodes have the correct versions of node and drive firmware, plus any patches that may have been added, before joining to the cluster.


     c. If the nodes are from different hardware generations or configuration, ensure that they fit within the Node Compatibility requirements for the cluster’s OneFS version.


3.     Set the Job Engine daemon to ‘disable’ an hour or so prior to adding new node(s) to help ensure a clean node join. This can be done with the following command:


# isi services –a isi_job_d disable


4.     Add the new node(s) and verify the healthy state of the exapanded cluster:


     a. Confirm there are no un-provisioned drives:


# disi –I diskpools ls | grep –i “Unprovisioned drives”


     b. Check that the node(s) joined the existing pools


# isi storagepool list


5.     Restart the Job Engine:


# isi services –a isi_job_d enable


6.     After adding all nodes, the recommendation for a cluster with SSDs is to run AutoBalanceLin with an impact policy of ‘LOW’ or OFF_HOURS. For example:


# isi job jobs start autobalancelin --policy LOW


7.     To ensure the restripe is going smoothly, monitor the disk IO (‘DiskIn’ and ‘DiskOut’ counters) using the following command:


# isi statistics system –nall –-oprates –-nohumanize

 

Monitor the ‘DiskIn’ and ‘DiskOut’ counters. Between 2500-5000 disk IOPS is pretty healthy for nodes containing SSDs.


8.     Additionally, cancelling Mediascan and/or MultiScan and pausing FSanalysis will reduce resource contention and allow the AutoBalanceLIN job to complete more efficiently.


# isi job jobs cancel mediascan

# isi job jobs cancel multiscan

# isi job jobs pause fsanalyze


Finally, it’s worth periodically running and reviewing the Isilon Advisor health check report - especially pre and post configuring changes and adding new nodes to the cluster.


adding_nodes_&amp;_changing_protection_4.png

 

The Isilon Advisor diagnostic tool will help verify that the OneFS configuration is as expected and verify there are no cluster issues.

Got asked the following question from the field recently:

 

“If I configure a nodepool to use L3 cache, will it be overridden by the “Metadata SSD Strategy” setting in a file pool policy?”

 

The SSDs in a OneFS nodepool can be used exclusively for L3 cache or an SmartPools SSD strategy, not both. All the SSDs in the pool will be formatted entirely differently for each of these options.

 

If the SSDs are reserved for L3, they will be formatted as a large linear device for use as an LRU (least recently used) read cache. Otherwise, the SSDs will be formatted as a regular storage device for use in the OneFS file system under SmartPools SSD strategies.

 

You can tell how a pool’s SSDs are being utilized from the WebUI, by navigating to Dashboard > Cluster Overview > Cluster Status.

 

Any SSDs reserved for L3 caching will be exclusive reserved and explicitly marked as such, and their capacity will not be included in any SSD usage stats, etc.

 

For example, take a Gen5 cluster with two node pools:

 

  • Nodes 1-3 are X410s using their SSDs for metadata read.

 

  • Nodes 4-6 are S210s with their SSDs reserved exclusively for L3 cache.

 

ssd_reservation_1.png

 

L3 cache is enabled per node pool via a simple on or off configuration setting. Other than this, there are no additional visible configuration settings available. When enabled, L3 consumes all the SSD in node pool.

Please note that L3 cache is enabled by default on any new node pool containing SSDs.

 

ssd_reservation_2.png

The WebUI also provides a global setting to enable L3 cache by default for new node pools.

 

ssd_reservation_3.png

 

Enabling L3 cache on an existing nodepool with SSDs takes some time, since the data and metadata on the SSDs needs to be evacuated to other drives before the SSDs can be formatted for caching. Conversely, disabling L3 cache is a very fast operation, since no data needs to be moved and drive reformatting can begin right away.

 

Although there’s no percentage completion reporting shown when converting nodepools to use L3 cache, this can be estimated by tracking SSD space usage throughout the job run. The Job impact policy of the FlexProtect_Plus or SmartPools job, responsible for the L3 conversion, can also be re-prioritized to run faster or slower.

 

Unlike HDDs and SSDs that are used for storage, when an SSD used for L3 cache fails, the drive state immediately changes to ‘REPLACE’ without a FlexProtect job running. An SSD drive used for L3 cache contains only cache data that does not have to be protected by FlexProtect. Once the drive status is reported as ‘replace’, the failed SSD can safely be pulled and swapped.

Filter Blog

By date:
By tag: