(please also take a look at SMB throughput benchmarking with fio.exe article)

Performance benchmarking of Isilon Clusters during Proof Of Concepts (POCs) is quite often requested by customers, compared to "simple" integration and functionality testing:


nfs_isilon_throughput_benchmark_fio_datalake_bigdata_hadoop_keywords_matter.jpg

Source: Jet Engine Test • 60 Seconds Of Screaming Power - YouTube


This article explains how 'flexible IO' (fio, pronounced "fa-yo") utility benchmark testing in done in EMC Isilon's Technology Marketing Engineering (TME) Performance team lab, results of which act as input for Isilon Sizing tool. This post suggests how to build a run the fio benchmark in a way to potentially closely achieve the results from Isilon Sizing Tool during in-field POCs, with controllable Client:Node:Thread ratio. 

GitHub-Mark-120px-plus.png

Scripts on GitHub


Few words on POCs

When System Engineers are involved in POC-s, if possible the intended workload or customer's application should be tested. This provides “real solution” performance results, rather than approximations which may be irrelevant to the customer and may hence misrepresent the Isilon cluster's performance capabilities.

 

Few words on Sizing

The pre-sales motion of EMC Isilon storage includes working with sizing tool, available for EMC employees and EMC Business Partners at http://isilon-sizing-tool.herokuapp.com/.

 

The accepted best-practice of Isilon sizing is to try addressing 70...80 % of initial requirements and assumptions of the performance requirements during first iteration of collaboration with customer, followed by review session. During the review, System Engineers would uncover  additional details about workflows, plans on solution lifecycle, implementation waves that would cause the non-usual load (ingest, bulk copy) and etc.. For example, half year later after cluster implementation there could be different read/write ratios, average file sizes and directory depths, access patterns, due to the growth of the adoption of Isilon, -- and those could impact the performance if not factored upfront in performance roadmap.

 

Cluster-side and Client-side settings

 

The EMC Isilon TME Performance Team uses the following the default cluster configuration for all tests on all versions of OneFS and hardware:


  • Isilon nodes have 2 x 10GbE NICs configured in same subnet (no LAG) with MTU 1500
  • Clients have 2 x 10GbE Intel x520 NICs, and in CentOS the Large Receive Offload (LRO) is disabled in driver;
  • Coalescer, or as it is marketed, SmartCache is On;
  • “Read Transfer Size” on Isilon NFS export side is set to 131072 (default value);
  • “Write Transfer Size” on Isilon NFS export side is set to 524288 (default value);
  • Mounts from CentOS 6.x Linux OS, however, are done using 1024k RSIZE and 1024k WSIZE


Important Recommendations For Perfomance


Here are most important general recommendations to optimize the data:

  • Prior to testing, try to uncover the realistic network throughput achievable between all endpoints involved in POC. Isilon OneFS ships with iperf client/server utility pair, so it is necessary to test all download and upload paths between IP-s one by one.
  • One of OneFS’s unique features is the ability to set the protection level on a per-file or per-object level. Thus, it is important to ensure that the required protection levels (i.e. 2d:1n) or mirroring (i.e. 3x) are are relevant parts of test scenario. They could be set either via WebUI of OneFS, or by executing a command:


isi set –p +2:1 /ifs/data/YOUR_DIRECTORY_OF_TESTS


  • Prefer “Streaming” set on the directory of NFS export at Isilon side, to leverage greater read-ahead for reads with Adaptive Pre Fetch, and also trying to leverage different drives rotation between stripes when writing. This could either be done in a GUI, or via execution of a command:


isi set –l streaming –a streaming /ifs/data/YOUR_DIRECTORY_OF_TESTS          


...to set the data layout to "Default" and access pattern to "Random", it would be:


isi set –l default –a random /ifs/data/YOUR_DIRECTORY_OF_TESTS


  • In both examples above, the simple setting of data layout or protection level would not restripe existing data, hence it is assumed that cleanups are done between tests. If, for some reason, it is required to restripe the existing dataset, the "-rR"   should be appended to recursively restripe the contents of the selected directory. This operation is impactful and is much slower than the cleanup approach:


isi set -rR –l default –a random /ifs/data/YOUR_DIRECTORY_OF_TESTS


  • If the POC terms require multiple parallel Linux clients testing, prefer to start from "1:1:1" Client:Node:Thread ratio profile.
  • Isilon serves best throughput doing Sequential Reads and Sequential Writes with block sizes of 32kB and higher, as these are known "best fit" workloads of Isilon. Any block size and access pattern, starting from 4KB, could be tested as well, indeed. Do not forget that OneFS block size is 8KB.
  • Prefer tests with file sizes of 12GB and larger for duration and sustainability. EMC Isilon TME runs tests with 50GB files.
  • Prefer physical servers as NFS clients. Otherwise, set up RAM reservations and and "High" CPU shares for Virtual Machines, and isolate them from any workloads into top-most Resource Pool in the VMware hierarchy.
  • Flush caches and on both NFS client and on EMC Isilon cluster, to have an unbiased disk throughput numbers between tests.
    • On Linux client: on majority of distributions, sync command without particular parameters could be used to flush filesystem buffers, otherwise, unmount the NFS export and re-mount it afterwards, i.e.:


sync && echo 3 > /proc/sys/vm/drop_caches


    • EMC Isilon cluster service command:


isi_for_array isi_flush


...to purge L1 and L2 caches. On clusters with OneFS 7.1.1 and newer, it is not advisable (due to loss of metadata and small random read blocks evicted from L2 to SSD-s), but also possible to purge L3 cache:


isi_for_array isi_flush --l3-full



Isilon FIO Harness for Client:Node ratio NFS benchmarking


For the field testing of a controllable Client:Node:Thread ratios, a single virtual machine, also referred to as "harness", would be required, that would then be connecting to physical Linux clients participating in the test and distributing the commands. Neither DNS infrastructure nor EMC Isilon SmartConnect functionality are required. SmartConnect Advanced testing is assumed present in the default POC scenario anyway.


The overview of the 4-node EMC Isilon cluster benchmarking set-up, including a harness server and four (4) NFS clients is as follows:

nfs_isilon_fio_benchmark_throughput_performance.png

Figure 1 - a Four-node Isilon cluster example with all control scripts stored on Isilon and mounted as /mnt/isilon on Harness

Figure 1 depicts the following components:


  • Isilon Cluster
    • root user is advised to be configured with 'a' one-letter password for simplicity;
    • Disable "root squashing" on the NFS export, so that all commands during the performance benchmark could be executed by root user for simplicity
  • /ifs/data/fiotest - this folder on OneFS would be used on Isilon as the NFS export,
    • Has to be R/W to NFS 'root' user connecting from client nodes, folders from clients would be created in this folder for temporary files storage used during tests -- advice is to chmod 777 it before the creation of trusted.key and trusted.key.pub files that would require much narrower POSIX permissions.
    • fiojob_1024k_randread - a fio job description file, defining the IO pattern, size of the temp.file to be created and so on.
    • fiojob_1024k_seqwrite_5t - a fio job description file defining the read/write pattern, and also the 5 threads from client
  • control/ - sub-folder that would be storing the control files for the set-up:
    • cleanup_remount_1to1.sh - script that would do housekeeping after runs and re-mounts of exports
    • nfs_hosts.list - list of client servers participating in the test in <IP_of_Linux_Server>|<IP_of_Isilon_Node> format;
    • run_nfs_fio_1024k_randread.sh - bash file pointing to corresponding fio job;
    • run_nfs_fio_1024k_seqwrite_5t.sh - bash file pointing to the thread-controlled fio job;
    • trusted.key and trusted.key.pub - generated public and private keys to avoid entering passwords during distribution of commands
  • LinServer0...4 - standard Linux distribution with fio package installed.
    • LinServer0 is the "control" server, also referred to as a "harness server".


To prepare Linux servers to act as NFS clients, it is required to install fio package, or download, make and install from sources. Simplest  way to install fio on modern Linux distribution from RepoForge repository. For Red Hat Enterprise Linux (RHEL) or CentOS, one could try:


[root@LinServer0 ~]# yum install fio


If the package is not found, one could follow the steps outlined in http://repoforge.org/use/ andhttp://wiki.centos.org/AdditionalResources/Repositories/RPMForge to install the latest .rpm of RepoForge first, then be install fio package. If LinServers1...4, are allowed to be Virtual Machines, one could clone them after fio is installed in LinServer0.

After LinServer0...4 are prepared and have fio installed, LinServer0 root user should mount Isilon's NFS export. Other servers would be mounted with a script afterwards.


[root@LinServer0 ~]# mkdir -p /mnt/isilon

[root@LinServer0 ~]# mount fiotest:/ifs/data/fiotest /mnt/isilon


Check whether the mount operation succeeded by:


[root@LinServer0 ~]cd /mnt/isilon


and create a control folder:


[root@LinServer0 ~]mkdir -p /mnt/isilon/fiotest/control


The next step is to generate the certificates required for authentication with trusted private (trusted.key) and public (trusted.key.pub) keys:


[root@LinServer0 ~]# ssh-keygen –t dsa


Follow the interactive wizard by hitting "Enter" several times, agreeing to default values, and specifying the destination path for the newly-generated keys to be stored in /mnt/isilon/fiotest/control

Next, one needs to copy the keys to all NFS Clients participating in the test, as well as to one of the nodes in Isilon cluster. The connection to the cluster would be required to request cache flushing before the run.

For Isilon node, run:


[root@LinServer0 ~]# ssh-copy-id -i /mnt/isilon/fiotest/control/trusted.key.pub IP.OF.ISILON.NODE


For the rest of Client nodes, create nodes.list that would be used by copy_trusted.sh and other scripts.


[root@LinServer0 ~]# cd /mnt/isilon/fiotest/control/

[root@LinServer0 control]# vi nfs_hosts.list


nfs_hosts.list

Enter the IP addresses or hostnames of LinServer1...4 servers on separate lines, paired with Isilon nodes for 1:1 mapping. To do other mappings, keep repeating the Isilon nodes (for Client > Node ratio) or repeating Clients (for Client < Node ratio) per-line.

Two important notes:


1) Do not include the IP of the control ("harness") server here!

2) Do not mix up -- Hostnames of Linux first -- pipe -- Isilon Nodes Last


10.111.158.196|10.111.158.206

10.111.158.197|10.111.158.207

10.111.158.198|10.111.158.208

10.111.158.213|10.111.158.209



Now that mapping of Client:Node is defined, let's prepare the following:


nfs_copy_trusted.sh

It would only be used once, to copy trusted keys to Linux clients, just as been done for Isilon manually


#!/bin/bash

#apart from running this script, copy trusted file to the Isilon node

# of choice that would be used to clear cache when running the fio job

# by the same command as below

## the rest of the file is similar to most of the other scripts

## first go through all lines in nfs_hosts.list

for i in $(cat /mnt/isilon/fiotest/control/nfs_hosts.list) ; do

# then split each line read in to an array by the pipe symbol

IFS='|' read -a pairs <<< "${i}";

# do the ssh-copy-id for putting the certificate to remote host

ssh-copy-id -i /mnt/isilon/fiotest/control/trusted.key.pub ${pairs[0]}

done


cleanup_remount_1to1.sh

This script would be used after every test run. It connects to LinServers in nfs_hosts.list, does housekeeping, and then re-mounts the nodes as per mapping back to Isilon cluster. The activities include deletion of all temp files on Isilon, remount of the export, re-creation of per-client temp folder.


#!/bin/bash

#first go through all lines in hosts.list

for i in $(cat /mnt/isilon/fiotest/control/nfs_hosts.list) ; do

# then split each line read in to an array by the pipe symbol

IFS='|' read -a pairs <<< "${i}";

# show back the mapping

echo "Client host: ${pairs[0]}  Isilon node: ${pairs[1]}";

# connect over ssh with the key and mount hosts, create directories etc. - has to be single line

ssh -i /mnt/isilon/fiotest/control/trusted.key ${pairs[0]} -fqno StrictHostKeyChecking=no "rm -rf /mnt/isilon/fiotest/${pairs[0]}; sleep 1; umount -fl /mnt/isilon; sleep 7; mkdir /mnt/isilon;  sleep 5; mount -o wsize=1048576,rsize=1048576 ${pairs[1]}:/ifs/data/ /mnt/isilon/; sleep 7; mkdir /mnt/isilon/fiotest/${pairs[0]}";

# erase the array pair

unset pairs ;

# go for the next line in nfs_hosts.list;

done

 

Please note mounting/unmounting requires root privilege , since sudo doesn't work via ssh-encasulated instruction as it requires a prompt for elevation.


run_nfs_fio_1024k_randread.sh

This is the instruction for Linux nodes to execute a particular fio job, which would be specified separately in another file fiojob_1024k_randread. This file is not co-located in control directory for operational preferences only. It's just easier to tabulate when multiple concurrent jobs need to be launched using the '&&' within control folder.


Few points to note:

1) Isilon TME Performance team uses 50GB files

2) For random workloads, "Random" data access pattern should be set on the directory of tests. Use "Streaming" for sequential.

3) For random workloads test, "randrw" is never used in Isilon TME Performance lab, rather a test of 100% random read, and concurrently 100% random write testing is done launching jobs though '&&' .


The purple-marked commands flush cache on Isilon and on each of Linux Clients:


#!/bin/bash

#first, connect to the first isilon node, and flush cache on array

echo "Purging L1 and L2 cache first";

ssh -i /mnt/isilon/fiotest/control/trusted.key 10.111.158.206 -fqno StrictHostKeyChecking=no "isi_for_array isi_flush";

# wait for cache flushing to finish, normally around 10 seconds is enough

# on larger clusters, sometimes up to few minutes should be used!

sleep 10;

#the L3 cache purge is not recommended as all metadata accelerated by SSDs is going. but, maybe...

#echo "On OneFS 7.1.1 clusters and newer, running L3, purging L3 cache";

#ssh -i /mnt/isilon/fiotest/control/trusted.key 10.63.208.64 -fqno StrictHostKeyChecking=no "isi_for_array isi_flush --l3-full";

#sleep 10;

# the rest is similar to the other scripts

# first go through all lines in nfs_hosts.list

for i in $(cat /mnt/isilon/fiotest/control/nfs_hosts.list) ; do

# then split each line read in to an array by the pipe symbol

IFS='|' read -a pairs <<< "${i}";

# connect over ssh with the key and mount hosts, create directories etc. - has to be single line

# echo 3 > /proc/sys/vm/drop_caches

# sync purges all buffers to disk

# pointing to fio job file that is one level above from control directory

ssh -i /mnt/isilon/fiotest/control/trusted.key ${pairs[0]} -fqno StrictHostKeyChecking=no "sync && echo 3 > /proc/sys/vm/drop_caches; FILENAME=\"/mnt/isilon/fiotest/${pairs[0]}\" fio --output=/mnt/isilon/fiotest/fioresult_1024k_randread_${pairs[0]}.txt /mnt/isilon/fiotest/fiojob_1024k_randread";

done


Next, move one folder up, back to /mnt/isilon/fiotest/ and create the corresponding fio job file.


[root@LinServer0 control]# cd /mnt/isilon/fiotest/

[root@LinServer0 fiotest]# vi fiojob_1024k_randread


Please note that for Read testing, every thread would pre-create its own "Read Target" temp file. There is also a way for multiple threads to read (or write) to a single shared file by specifying it using filename=${filename}.tmp for example. Refer to FIO documentation for greater details.


fiojob_1024k_randread

The following file would execute 100% random read with 1M blocksize worth 36GB of bandwidth.


; --start job file --

[global]

description=-------------THIS IS A JOB DOING ${FILENAME} ---------

directory=${FILENAME}

rw=randread

size=36G

bs=1024k

zero_buffers

direct=0

sync=0

refill_buffers

ioengine=sync

iodepth=1

[1024k_randread]

; -- end job file --



There are many other possible options of fio jobs, they could be seen in manual pages of fio, by running man fio. or at fio(1): flexible I/O tester - Linux man page


By now, all files are in set, and one may execute the test.


[root@LinServer0 fiotest]# ./control/run_fio_1024k_randwrite.sh


It is possible to observe the test using Isilon's WebUI, that would be updated every 5 seconds.


Adding 'Thread' in to Client:Node:Thread ratio


When one needs to run benchmarking tests with more threads per Isilon node (N:1, where N>1), it could easily be done by specifying the "numjobs" in fio job files. For example, the following job would create sequential write 5 threads from every Linux client host:


fiojob_1024k_seqwrite_5t


; --start job file --

[global]

description=-------------THIS IS A JOB DOING ${FILENAME} ---------

directory=${FILENAME}

rw=write

size=36G

bs=1024k

zero_buffers

direct=0

sync=0

refill_buffers

ioengine=sync

iodepth=1

numjobs=5

[1024k_seqwrite_5t]

; -- end job file --


Useful: Limiting per-Thread throughput


In some Media & Entertainment type of workloads, it is often required to provide the evidence of stable thread throughput at particular bandwidth.


fiojob_1024k_randwrite_5t_7500kB

The following file would execute 5 threads per Linux client, 100% read IO, at 7.5 MB/s per thread, with 1M blocksize, worth 12GB


; --start job file --

[global]

description=-------------THIS IS A JOB DOING ${FILENAME} HOST---------

directory=${FILENAME}

rw=read

size=12G

rate=7500k

numjobs=15

bs=1024k

zero_buffers

direct=0

sync=0

refill_buffers

ioengine=sync

iodepth=1

[1024k_randread_5t_at7500kB];

-- end job file --


Collecting Isilon NFS Throughput Results


During the test, to collect NFS protocol total throughput statistics from the Isilon cluster to a comma-separated file for further analysis, one should log in to any of Isilon nodes and execute the iterative "loop" with 5 seconds delay of the following command:


# isi statistics protocol --protocols=nfs3 --totalby=Proto --csv --noheader --i=5 --r=-1 >> /ifs/data/fiotest/nfs_1024k_describe_test_further.csv


One could use SCREEN or any other tool to suspend and com back to the command on the completion of the test and then interrupt it i.e. by hitting Ctrl+C


The comma-separated file could then be imported to i.e. Microsoft Excel for scatter plotting, finding the maximums, averages, median values of the total cluster throughput. The "missing header" (as the --noheader been specified above) would be:

OpsInOutTimeAvgTimeStdDevNodeProtoClassOp
N/sB/sB/susus


The way the results are collected by Isilon TME Performance Lab team is as follows.

  • Identical tests of the same blocksize & read/write pattern are ran 3 times, separated by some amount of time. For example, after testing 512kB Random Write, test of 128kB Sequential Read follows, then few other tests, then 512kB Random Write again. That is done in attempt to remove any systemic problems (i.e. related to network congestion etc.) that could've otherwise happened;
  • The results are cleared from start-up and ending "tails" where i.e. no actual throughput been applied, but the sampling happened;
  • The median results are selected and put as official results in to empirical data used by Isilon Sizing Tool;

 

Good luck with benchmarking!