Bold disclaimer: it is a best-effort attempt to consolidate references and outcomes of researches for several real-world Isilon customer cases. I'm a (VCAP5-DCD) VMware Certified Advanced Professional 5 - Data Center Design but Virtual Desktop Infrastructure is not the daily focus of mine.


Seeking Input: if there is a best-practice that is not mentioned in this post, please, do not delay, - contribute it in comments and I'll update the blog post.

 

Recommended prerequisite reading:


Let's start with a quick reminder of an outstanding, field-proven, scale-out infrastructure solution for VDI:

00_xtremio_vdi_vsphere_citrix_isilon_home_directory.png

 

Also, I'd recommend to review the VCE Technology Extensions | Virtualization & Cloud Computing | VCE page and find out more about VCSALE architecture integration with Isilon, by means of well-managed and fully-supported Isilon Technology Extension.


Isilon is used as the destination for unstructured data only and all aspects of VDI images and boot storms are handled by block IO components of VCSALE stack:

00_vce_vblock_vscale_vdi_performance_review.png


To achieve even higher Return On Investment, it is important to ensure that details of VDI are optimally configured and perform in unison. I came across several "sub-optimal" load profiles SMB workflows with Home Folders (a.k.a. Home Directories). Researching through them, I came up with the following list of recommendations.


Each topic below must start with an "If possible...", since it is a suggestion, not a requirement. Since each customer's environment is unique, the recommended approach is to go through the "big picture" of VDI during joint workshop with end-customer's project management, delivery partner(s), VDI / Virtualization vendor(s), and with End-User Computing (EUC) engineering and/or Active Directory / Security teams.


So, if possible...


1) ...double-check the settings of Access Time Tracking


Access Time Tracking is often used for automated tiering to CloudPools or lower $/TB nodepools of Isilon platforms. It may also be used as input for SyncIQ replication policies. Isilon ships with "atime" tracking disabled by default. If it is enabled, the default grace period is set to 86,400 seconds (1 day), because updating metadata of "touched" files in realtime is very expensive operation. It is important to cross-check whether, at any pre-production User Acceptance Testing the "atime" was accidentally left enabled with low grace period.


What "atime" does is that it turns any SMB (or other protocol) read request of a file into metadata write. The impact on SMB protocol Home Directory workflow is in form of high CPU utilization, due to excessive SMB Change Notification traffic, since every I/O, read or write, would trigger a Change_Notify response to every established SMB connection.


If the environment has to have it enabled, the recommendation is to keep the grace period at 86,400 seconds (1 day) so that the overhead is minimized. 


2) ...cross-check Active Sessions in Technical Reference and check Security Signatures


The principle of sizing for VDI / Home Directories is the number of concurrent active SMB connections per Isilon node and thus per cluster. The number is in "Technical Specifications Guide", i.e. at the moment of writing -- for OneFS 7.2 https://support.emc.com/docu56230_Isilon-OneFS-7.2-Technical-Specifications-Guide.pdf


Please, take a look at the following screenshot of the latest version (at the moment of writing):

isilon_datalake_smb_vdi_vmware_nfs_hdfs_keywords_are_cool.png


Security Signatures reduce the guideline by 33%. Checking if the design considers using SMB Security Signatures is crucial. For more information, refer to Require SMB Security Signatures article on Microsoft TechNet.


Note the "Monitoring the number of SMB2 connections to each node is important to make sure that a node does not become overloaded with  connections" in the very end of Technical Specifications Guide's "Description" column? It means that it is not advisable  to rely on aspirational number mentioned as a "guarantee". In worst case scenario, this disclaimer is advocating for adding nodes in to the cluster.


There simply cannot be hard number determined with 100% accuracy in a dynamic world of scale-out NAS, this is why it's a guideline only.


3) ...check if VDI image was customized with VDI tuning / optimization


Usually, Windows Client image used in VDI environment goes through customization by End User Computing (EUC) engineering team. It is important to understand which services start, how is the default User Profile configured, and so on. Some common references are:



As an example, almost all of them suggest disabling "Offline Files Technology" on VDI image, that would allow Home Directories users of non-VDI environments to conform to default setting of Folder Redirection, see more details in (4) below.


Another "often overlooked" customization is HKLM\System\CurrentControlSet\Services\LanManWorkstation\Parameters


  • DisableBandwidthThrottling (REG_DWORD). This key does not exist by default and has an assumed value of 0. The SMB 2 client will try to limit its network throughout on links that it perceives as latent. Since most VDI deployments should be mapping drives to file servers that are highly connected, there is no need to try and limit the throughput of the SMB sessions. This value should be set to 1.

 

4) ...balance the customer's perception of OPS/user with industry's best practices


It's worth reminding that Operations Per Second (OPS) are applied to block storage sizing, and in NAS the "Protocol OPS" versus average latency is concerned. But In most of the cases of VDI discussion it's easier to simply avoid this debate. The much more important point is that the realistic number of OPS per user is assumed during sizing.


OPS/user number is subjective. From the individual user perspective, performance of VDI is evaluated by a "2-Rule Simplified Benchmark":

  • "What's VDI? Ah, my Dell WYSE client, - it just works."
  • "That horrible slow VDI? I don't like it and blame IT".

 

This approach isn't practical as it leaves a lot of guessing to architects and solution designers. It's common to assume anything from 4 to 10 OPS per user. Following sources might be helpful for further research and backing this 4...10 assumption:



For example, the VMware View's whitepaper presents "IT Industry IOPS users" table:

02_isilon_onefs_smb_vdi_vmware_nfs_openstack_keywords_are_cool.png

While the paper on VNX sizing presents a slightly different perspective:

03_scaleout_datalake_microsoft_vmware_nfs_devops_keywords_are_cool.png

 

When sizing, one must refer to SPECsfs2008 "Home Directory" CIFS operations mix empirical data as the benchmark of what an Isilon cluster is able to achieve in Protocol OPS. It's handy to google the 'isilon result site:spec.org' query,  click here for example output  .

There are much more results from various tests available for all Isilon platforms (NL410, X210 etc.), -- reach out to EMC Presales for more insights.


5) ...with Folder Redirection in VDI, disable Offline File Technology (OFT)


Folder Redirection to Home Directory is the "rule of thumb" to improve logon speed and create better user experience. A must read set of up-to-date articles about Folder Redirection:


 

Offline File Technology (OFT) is enabled by default for all Folders Redirection policies deployed, and this nuance could cause a different performance picture for VDI instances.


Isilon is one of the many NAS vendors that is fully compliant to SMB2.x and SMB3.x protocol specification. These specifications do not dictate how enumeration (listing) of files in a given folder in returned back to the client. Some applications, - and, paradoxically, Microsoft Sync Center, the native OFT tool built in in Windows client, - assume that the enumeration will be returned in alphabetically sorted list.

Microsoft Sync Center relies on this assumption for OFT file list change tracking, so if any SMB2.x or SMB3.x Server (in this case Isilon) is not providing the sorted list, the OFT performance is impacted.

More details could be found in KB3046857 Sync Center: Slow syncing of offline files on some file servers on Microsoft side. If OFT performance is a strict requirement (chances are it is not the case of a VDI deployment), there are some alternative solutions that would be able to match OFT functionality without Sync Center:


  1. Microsoft SyncToy Download SyncToy 2.1 from Official Microsoft Download Center
  2. FreeFileSync (open source) FreeFileSync download | SourceForge.net
  3. SugarSync https://www.sugarsync.com/en
  4. GoodSync File Sync & Backup Software | GoodSync
  5. Allway Sync Allway Sync: Free File Synchronization, Backup, Data Replication, PC Sync Software, Freeware, File Sync, Data Synchroniz…
  6. Syncplicity and others

For convenience, there is a comprehensive list of commercial/free products and comparison on Wikipedia Comparison of file synchronization software - Wikipedia, the free encyclopedia


Getting back to the Folder Redirection. This article doesn't cover the details of workflow when OFT with Folder Redirection are interworking, - Folder Redirection functionality is a quite complex mixture of Windows Explorer, SMB client and Client-Side Extension libraries of Group Policy:


04_isilon_apache_hadoop_cloudera_hortonworks_impala_hbase_kudu_hive_nosql_keywords_awesomeness.gif

Source: How Folder Redirection Extension Works: Group Policy

 

 

User Configuration\Administrative Templates\Network\Offline Files and in Computer Configuration\Administrative Templates\Network\Offline Files


Do not automatically make redirected folders available offlineBy default, makes all redirected shell folders, such as My Documents, Desktop, Start Menu, and Application Data, are available offline.This setting allows you to change this behavior so that redirected shell folders are not automatically available for offline use. However, users can still choose to make files and folders available offline.Do not enable this setting unless you are certain that users do not need access to all of their redirected files in the event that the network or the server holding the redirected files becomes unavailable.This setting does not prevent files from being automatically cached if the network share is configured for automatic caching, nor does it affect the availability of the Make Available Offline menu option in the user interface.


6) ...for SMB shares check Directory Change Notify setting "norecurse"


This setting, if turned off to "none", may contribute significant amount of extra protocol operations with GetInfo requests, particularly with Folder Redirection enabled. SMB clients would no longer be able to subscribe to incremental changes in directories and thus would need to send SMB2 QUERY_DIRECTORY  over and over again to enumerate the directory, as illustrated in [MS-FASOD]: Common Task 2: Enumerate a Directory Using the SMB Protocol.

Having the "all" setting configured may contribute to large amount of Change Notification traffic on SMB protocol, because the updates would be sent for any changes beneath the current level where SMB Client last subscribed to the notification. However, some EMC-internal empirical tests showed that the difference of "all" to "norecurse" isn't too concerning.


7) ...check OpLocks setting "Yes" for SMB share


Since in VDI or Home Directory environment contention on the same file set is not possible, due to path separation, there is hardly any reason why Windows Clients would cause OpLock breaks. Hence, client-side data caching with OpLocks is the best practice, so should be left with "Yes".


8) ...reduce "extra IO" from all possible services


There are quite a few areas where End-User Compute (EUC) engineering team of the customer could share their expertise and the detailed knowledge of their very own "fine-tuned" Windows client build.


Some questions to be asked are, for example:

 

  • What is the strategy on thumbnails and Thumbs.db generation? Consider random small read/write IO impact from every user.
    • User Configuration\Administrative Templates\Windows Components\Windows Explorer (Windows Vista/7) or File Explorer (Windows 8)
      • Turn off the display of thumbnails and only display icons
      • Turn off the display of thumbnails and only display icons on Network Folders
      • Turn off the caching of thumbnails in hidden thumbs.db files
      • Turn off caching of thumbnail pictures
  • Is Internet Explorer's "Favorites" folder redirected in to Home Directory?
  • What is the Anti-Virus / Compliance Crawler / etc. exclusion list for paths? Are paths on Isilon crawled weekly? Daily?
    • Client should not issue extra IO, whenever possible

 

9) ...isolate VDI's "Slow Boot" / "Slow Logon" concerns from Isilon IO concerns


In Roaming Profiles environment, where the profile data is stored on Isilon, it is very compelling to start "associating" any symptoms of Slow Boot and Slow Logon with Isilon. However, they could be completely unrelated, and could be subject to "as implemented" constrains on the Group Policies applied to VDI instances.

It is advised to do tops-down research of Slow Boot / Slow Logon symptoms, and get it done with support of End User Compute (EUC) team of the customer. It is beneficial to align particular Group Policies implementation with best practices outlined in, for example, Group Policy and Logon Impact - Group Policy Team Blog - Site Home - TechNet Blogs.


One of the approaches to analyse "Slow Boot" / "Slow Logon" symptoms is by tools like GPO Exporter : http://sdmsoftware.com/group-policy-management-products/group-policy-exporter/  :


05_hbase_apache_spark_isilon_onefs_emc_shared_storage_keywords_work_well.png


It could uncover, for example, GPOs that trigger external scripts to be ran, WMI filters executed, and so on. Those could be taking the valuable time, even without any throughput to NAS.


The following article, as of time of writing, has one of the most complete reference lists for Group Policy best practices in VDI Environment: Group Policy – VDI, Best Practices and Tools | Virtually Virtuoso


10) ...ensure Active Directory Domain Controllers are sized appropriately


Usually, cross-checking the bottleneck in Domain Controller comes absolutely last in troubleshooting priorities. However, it is always worth doing a quick inquiry with Security Team whether the DC is sized according to the load expected during log-on time of VDI environment. Along with "traditional" Windows logins, due to the presence of Home Directory on SMB share, there's a second "set" of Isilon-originated requests to DC that has to be factored.


If DC is a virtual machine, and runs in the shared resource pool filled with siblings, somewhere in the under-provisioned separate Management Cluster, the CPU Ready and vRAM Baloon size would be good to check. Simply, the more RAM the DC has, the better.

To maximize the scalability of the server the minimum amount of RAM should be the sum of the current database size, the total SYSVOL size, the operating system recommended amount, and the vendor recommendations for the agents (antivirus, monitoring, backup, and so on). Most common DC VM size in the field is around 4vCPU/8GBvRAM, with aspiration of serving around 10000 users.

The two recommended reading whitepapers related to DC sizing are:

...and the latter has a dedicated part that covers the "endorsed" maximum deviations from the Capacity Planning best practices under Monitoring For Compliance With Capacity Planning Goals chapter.


11) ...consider whether paths set in Active Directory level correlate at SyncIQ policies level


Home Directories are defined using a UNC path and are commonly pointed to a particular user's sub-folder in a wide share using some variables, i.e. %username% :

07_isilon_vdi_vmware_citrix_view_horizon_keywords.png

Source: http://blogs.technet.com/b/askds/archive/2008/06/30/automatic-creation-of-user-folders-for-home-roaming-profile-and-redirected-folders.aspx


Groups of users in Active Directory are managed Organizational Units, and a lot of bulk-changes are done using PowerShell scripts against the DC. On Isilon's end, the RPO and RTO requirements for different types of users is driven by SyncIQ policies, that are defined on a per-directory path level:

08_isilon_vdi_vmware_citrix_view_horizon_keywords.png

Note: it is not smart to replicate /ifs entirely, just saying.


So, if there are multiple user groups that map to various business continuity levels (RPO/RTO), then correlation of SyncIQ policies and paths that they protect with the paths defined for users Active Directories should be in the design.


12) ...consider changing default number of SyncIQ workers on larger clusters


SynclQ uses a distributed, multi-worker policy execution engine to take advantage of aggregate CPU and networking resources across the cluster. The "as shipped" default limits of SyncIQ, as of writing, available in Technical Specifications Guide for OneFS 7.2 ( https://support.emc.com/docu56230_Isilon-OneFS-7.2-Technical-Specifications-Guide.pdf ). Note that the following parameters could be changed:


  • SyncIQ: Workers (Per Node) - 3 (could be changed)
  • SyncIQ: Workers (Per policy) - 40 (could be changed)


The 40 workers per policy was an empirically tested best-practice number. Since the amount of SMB connections per node would be one of prevailing sizing drivers for Home Directories and VDI workloads, wide cluster sizes, exceeding 18 or 20 nodes, would not be unusual. The median file size, especially in case of Folder Redirection for AppData in a Roaming Profile environment, could be small, and the file count could be extremely high. Hence, the number of workers per node and per policy could be changed.

 

Please refer to the latest version of the "Best Practices for Data Replication with SyncIQ" document for more information: http://www.emc.com/collateral/hardware/white-papers/h8224-replication-isilon-synciq-wp.pdf and don't hesitate to involve EMC Presales and Isilon CAE team during approximations and design.

 

13) ...comply with Microsoft's support policy for DFS-N as part of Disaster Recovery Strategy

 

Isilon could be used as a Microsoft Distributed FileSystem Namespace (DFS-N) target without any special notice, just following the steps outlined in EMC KB 16495 https://support.emc.com/kb/16495  .  Designs could be quite sophisticated and may include DFS-N priorities for multi-site deployments, where DFS would include both Prod and DR sites:

 

  • Configure a DFS namespace to create a unified namespace.
  • Configure one namespace folder to have multiple folder targets. Add the shared folder of the DR file server as a second DFS-N folder target. Enable all namespace folder targets, or only one Prod folder target at a time.
  • Configure target priority of DFS-N so that the client refers to the Prod Isilon cluster first. When Prod is not available, the client will be redirected to the DR server. In case of multi-site deployment, it's possible to specify that roaming users be directed to a file server that contains their user data / user profile and that is closest to their physical location.

 

That is all very reasonable but it is completely Unsupported Scenario. Microsoft has a special statement for DFS-N prioritization between a "Prod" and "DR" pair of DFS-R technology, separated by a WAN, in attempt to implement a "transparent" fail-over for disaster recovery:  https://support.microsoft.com/kb/2533009

 

This statement creates a gray area in the end-to-end solution supportability, and the customer should understand that same circumstances on data inconsistency could happen if DFS-N priorities are implemented transparently for the fail-over between targets in a SyncIQ pair.

 

14) ...explore more about Mac OS X clients and their integration in Home Folders strategy

 

Apparently, I just didn't want to wrap up a blogpost on 13 items, so this 14th is just for information.

There's a really nice solution for DFS and Home Folders on Mac OS X clients, --http://www.acronis.com/sites/default/public_files/product_documentation/How-DFS-Home-Directories-Work-with-ExtremeZ-IP-0…



Seeking Input: if there is a best-practice that is not mentioned in this list, please, do not delay, - contribute in comments and I'll update the blog post.