We are experiencing a rather weird behavior of our ScaleIO deployment. Firstly, let me describe the setup:
6 Dell FX2 blocks with two FC630 and FD332 nodes each.
Every node is connected with 4 10G links to an MLAG group of two Arista DCS-7150S-52-CL switches. Two of the uplinks are used for VM data and two - for ScaleIO
In every FX2 block one FC630 node is running ESXi 6.5U1 and one - Windows 2012R2 Core with Hyper-V. Both are contributing to ScaleIO storage and both are in the same and only PD.
In the FD332 storage nodes we are using 10 Dell 1.64 Tb SFF SAS spinning drives with two 800 Gb Mixed Use SSDs for rfcache.
The system runs quite smoothly with pretty much satisfying performance:
Yet there is a problem, and it's quite bad.
When uploading a file to the datastore or replicating a VM to the vmware part of the cluster - the speed never exceeds 16 MB/s per one session, i.e. if I were to copy two files simultaneously - the aggregate speed would be 31 MB/s more or less. As you might guess, when moving VMs with disks over 1Tb - such speed becomes a pain. Restoring from a backup (we use veeam B&R) also hits this limit. And all of this only applies to vmware part of the cluster - with Hyper-V we easily exceed 120 MB/s with the same activities. Once again, the ScaleIO cluster is the same on vmware and hyper-v. The hardware is also the same down to each and every P/N.
The most interesting part, however, is that Storage vMotion to the ScaleIO cluster from the same cluster we have been copying the VMs with veeam works fine!
Here is what we have checked:
Everything network-related - MLAGs, Arista configs etc.
ESXi configuration - LUN depth, vSwitch configs, SVM configuration etc.
We tried to avoid using vmk for veeam traffic by using two proxy VMs (Guest OS is windows 2016) - one in source esxi cluster (not scaleio) and one in destination (with scaleio). In such setup veeam directly attaches the snapshot to the proxy VM on both source and destination and thus traffic never gets to vmk - the destination proxy VM writes directly to the scaleio LUN. In such setup we are seeing "Highest active time: 100%" in windows resource monitor
So, any help would be greatly appreciated as we are really stuck with this situation.