Running Isilon OneFS in "Not All Nodes On Network" (NANON) and "Not All Nodes On All Networks" (NANOAN) configurations

NOTE: This topic is part of the Uptime Information Hub.

 

Running Isilon OneFS in “Not All  Nodes On Network” (NANON) and “Not All Nodes On All Networks” (NANOAN) configurations

Isilon’s scale-out architecture takes full advantage of additional resources within the OneFS cluster to provide necessary performance and load balancing in any environment. In the ideal configuration, every Isilon node would be connected to every network that clients might originate from. However, Dell EMC recognizes that there are circumstances in which this is not possible. These types of network configurations can be classified into two different scenarios: “Not All Nodes On Network”(NANON) and “Not All Nodes On All Networks” (NANOAN). This article will outline supported configurations and caveats or limitations to be aware of when you are using these configurations.

 

Not All Nodes On Network (NANON)

 

An example of a NANON deployment, where not all nodes are connected to a network, is shown in Figure 1.

 

In this example, the deployment consists of five Isilon nodes with only three of them connected to the network. The network is assumed to have full access to all necessary infrastructure services as well as client access. As noted, two of the nodes are not connected to any network.

 

Figure1_NANON.png
  Figure 1 - Not All  Nodes On Network (NANON) configuration

 

The NANON configuration is fully supported with the following caveats:

  • With the exception of the “Clone” mode, the OneFS “Permission Repair” job will not function, as credential look-ups against authentication providers will not be possible on nodes that are not connected to the network. The job will fail after the maximum failures (default 5, configurable) is reached.
  • SyncIQ performance might be impacted, as workers on nodes without connection will be assigned work but cannot perform the work due to lack of connectivity.
    • Workaround: Create a SmartConnect zone with Static IP definitions consisting of nodes with network connectivity only and target the SmartConnect zone as the source network within the SyncIQ policy.
  • Viewing file ownership/permissions from the Command Line Interface (CLI) of the nodes that are not connected to any network might result in inaccurate ownership listing. For example, doing a listing on nodes not connected to an authentication provider will result in the UNIX ID being displayed as the owner instead of a username.
  • CloudPools performance might be impacted. Nodes without network connectivity will be assigned work but will not be able to connect to upload the data. The job will eventually complete after sequential time-outs of each task on the nodes without connectivity.
  • Anti-Virus (AV) scanning will not function correctly, as an AV scan request can originate from any node, including nodes not connected to the network. This will result in clients being able to access files that have not been scanned or denied access to files they should be able to access when scanning fails. This behavior can be controlled through an AV scan configuration, but the files remain unscanned.

Split networks

 

Before we discuss NANOAN deployments, it is helpful to describe  “Split Network” configurations. Split networks are a common topology where customers separate client and administrative traffic onto separate networks - their  “Client” and “Administrative” networks respectively. Such network topologies are not directly related to NANON/NANOAN configurations but serve as a starting point to illustrate various configurations. The assumption is that all infrastructure services—such as DNS, Antivirus, SMTP, SNMP, Active Directory, LDAP, NIS, ESRS, WebUI/SSH, CEE, Syslog, and other API connections—are available only on the Administrative Network. Typically, the Client Network would have a limited dedicated subnet of IP addresses, which would make routing simpler. This style of deployment is illustrated in Figure 2.

 

Figure2_NANON.png  Figure 2 - Split  network configuration

 

Dell EMC's suggested deployment topology would be to have Subnet0 be the Administrative Network with gateway priority of 0 (Default Gateway). The Client Network would be Subnet1 with gateway priority of 1. This will route all outbound traffic through the Administrative Network. To ensure all traffic for the Client Network goes out the correct interface, implement Static Routes pointing to the Client Network gateway.

 

The configuration is fully supported with the following caveats:

  • The Administrative and Client networks must not contain overlapping IP addresses or subnets. In other words, the Administrative and Client networks must be a single routed IP space.
  • CloudPools may not function if connectivity to the Cloud provider is not routed correctly. For example, an ECS data store resides in the Client network, but a static route is not defined to route traffic to it. CloudPools will attempt to reach it through the Administrative network which will fail; the job will fail and error out.

NANOAN 1

 

The NANOAN 1 configuration is a variation of the “Split Network” configuration where some of the nodes may not have access to the Administrative Network. This type of deployment is illustrated in Figure 3.

 

Figure3_NANOAN.png  Figure 3 - Not All Nodes On All Networks (NANOAN) configuration

 

This configuration is fully supported with the following caveats:

  • The Administrative and Client networks must not contain overlapping IP addresses or subnets. In other words, the Administrative and Client networks must be a single routed IP space.
  • With the exception of the “Clone” mode, the “Permission  Repair” job will not function correctly, as nodes without connectivity to an  Authentication Provider on the Admin network will still be assigned tasks but will not be able to perform them. Without an Authentication Provider to look up the proper credentials, the job will fail.
  • SyncIQ performance might be impacted if the target cluster is not on the “Client network” side, as workers on nodes without connectivity in the Admin network will be assigned work but cannot perform the work due to lack of connectivity.
    • Workaround: Create a SmartConnect zone with Static IP definitions consisting of nodes with connectivity to the Administrative network and target the SmartConnect zone as the source network within the SyncIQ policy.
  • CloudPools might not function as expected due to routing and gateway access issues. For example, an ECS data store resides in the Client network, but a static route is not defined to route traffic to it. CloudPools will attempt to reach it through the Admin network, which will fail; the job will fail and error out.
  • Anti Virus scanning will not function correctly as an AV scan request can originate from any node, including nodes not connected to the Admin network where the AV Scanners reside. This will result in clients being able to access files that have not been scanned, or denied access to files they should be able to access when scanning fails. This behavior can be controlled through AV Scan configuration, but the files remain unscanned.
  • Viewing file ownership/permissions from the Command Line Interface (CLI) of the nodes that are not connected to the appropriate network might result in inaccurate ownership listing. For example, performing a listing on nodes not connected to an authentication provider on the Admin network will result in a UNIX ID being displayed as the owner instead of a  username, and the lookup will fail.
  • Alerts may not be sent if the CELOG master is running on a node without connectivity to the Admin network.
    • Check the CELOG logs to verify that the CELOG master node has Administrative network connectivity.
    • Ensure node 1 is connected to the Admin network as the CELOG master, which typically runs on the node with the lower LNN number (which is usually node 1).
    • Check the isi_celog_monitor.log to determine  which node is the CELOG master. The log output should be similar to:
      2016-08-24T21:50:29-07:00  <3.6> CSE-X200-1-1
      isi_celog_monitor[24509:MasterListener:masterinfo:471] 
      INFO: Transitioning to being master
      In this example, node 1 on cluster CSE-X200-1 is the CELOG master.
  • Protocols requiring authentication against an Authentication Provider might fail or return unexpected results when connecting to nodes without Administrative network, as no Authentication Provider can be reached to perform the authentication process.

 

NANOAN 2

 

NANOAN 2 is a variation of the split configuration that is common in a shared service provider environment where there are multiple separated populations of clients, and therefore multiple separated Client Networks. This type of deployment is illustrated in Figure 4.

 

In this deployment model, the Administrative Network houses all of the necessary infrastructure connectivity except the end user authentication provider services, such as Active Directory, NIS, LDAP, and so on. Each authentication provider resides on the same Client Network as the clients that it serves.

 

Note (updated 8-28-2018): Currently this configuration does NOT allow Active Directory on more than one GroupNet. See KB 496499 (requires login to Dell EMC Online Support) entitled, “When attempting to segregate Isilon cluster nodes for multitenancy with Groupnet configurations limiting access to DNS a domain join will fail timing out."  The error returned will be "Failed to reload join state : (null)."

 

We will update this article when this issue has been resolved.

 

 

Figure4_NANOAN.png  Figure 4 - NANOAN  with separated authentication providers

 

In this configuration, the Administrative Network is relatively simple.

 

Create four Groupnets as follows: Three client Groupnets with a Subnet0 each and the same gateway priority of 0, followed by the Administrative Groupnet, with a Subnet0 and gateway priority of 1. Then define individual static routes to the infrastructure services—like SMTP and SNMP—to go out through the Administrative Groupnet as follows:

 

  GroupNet0 -> Subnet0 -> Gateway (P0)
  GroupNet1 -> Subnet0 -> Gateway (P0)
  GroupNet2 -> Subnet0 -> Gateway (P0)
  GroupNet4 -> Subnet0 -> Gateway (P1) + Static route to  each infrastructure service

 

Note that some services may require a significant number of static routes which could lead to a convoluted route table. An example of this would be Anti-Virus.

 

This configuration is fully supported with the following caveats:

  • The Administrative and Client networks must not contain overlapping IP addresses or subnets. In other words, the Administrative and Client networks must be a single routed IP space.
  • With the exception of the “Clone” mode, the “Permission Repair” job will not function and should be disabled, as credential look-ups against authentication providers will not be possible on nodes not connected to the appropriate network. This job assigns workers from all nodes to perform the repair, thus nodes not connected to the correct authentication provider may result in incorrect permissions being applied. For example, the on-disk permission states that the file owner is “foo” with UNIX ID of 3000. In Client 1’s authentication provider, this maps to “Client1\foo”. However, the worker thread on node 3 - which is connected to “Client 2” - might map it to  “Client2\bar” thus applying the wrong SID owner to the file.
  • In order for SyncIQ to function, create a SmartConnect zone with Static IP definitions consisting of nodes with connectivity to the appropriate network and target the SmartConnect zone as the source network within the SyncIQ policy. Without this definition, SyncIQ might assign work to nodes without the necessary connectivity and will stall.
  • Viewing file ownership/permissions from the Command Line Interface (CLI) of the nodes that are not connected to the appropriate network might result in inaccurate ownership listing. For example, requesting a listing of files owned by “Client 2” on nodes connected to the “Client 1” network might result in a username from “Client 1” being displayed as the owner, as the look-up would be done using an Authentication Provider in that network instead of the correct user from the “Client 2” Authentication Provider.
  • As AV scan requests can originate from any node, there are security concerns around potentially sending customer data to an AV Scanner residing in another customer’s network in the event of a misconfigured default gateway or static route. We recommend that you block access to the ICAP port 1344 on the client networks.