OneFS Performance Monitoring and Planning

NOTE: This topic is part of the Uptime Information Hub.


14720236_ml.jpgAdministrators, application stakeholders, and management teams may want to understand how to measure their current EMC Isilon scale-out network-attached storage (NAS) workloads, because basic performance measurement knowledge can assist in understanding how adding new volumes or workloads can modify a performance profile. Customers who apply this knowledge to their environments can be assured that their "go-live" dates will be more successful in meeting operational needs from day one.


Who Needs Performance Profiling?


Administrators need this to:

  • Understand existing performance and capacity envelopes.
  • Review existing or prior performance-impacting events.
  • Provide a qualitative roll-up of needs and requirements to management.

Application stakeholders need this to:

  • Plan for future growth of existing applications.
  • Assertively query software vendors when there are workload changes due to upgrades or replacements.
  • Provide quantitative requests and set performance expectations for storage administration and management.

Management needs this to:

  • Produce a concise summary of what storage workloads exist, when more are needed, and why.
  • Shorten the funding approval process through confidence in the performance metrics.
  • Understand their storage workloads when working with software application vendors.

What Is Workload Analysis?

Workload analysis consists of reviewing the ecosystem of an application and the storage it lives on. You need to understand the configuration of the cluster, how clients see it, where data lives within it, and the application use cases. You need to determine:

  • How an application works.
  • What the user interactions are with the application.
  • What the network topology is.
  • Workload-specific metrics for networking protocols, disk I/O, and CPU usage.

Determine How an Application Works

  • If an application has a unique dataset, determine if it relies on a database (such as Oracle®), flat files (such as VMware® VMDKs), broad and deeply nested directory trees with few files per directory, or shallowly nested directories with large numbers of files per directory.
  • Determine how the application uses the stored data. For example, does it rely heavily on metadata reads and writes to the dataset, or does it have moderate metadata writes and intense metadata reads, as is the case with electronic design automation (EDA)? (For more information on EDA workflows, see the EMC Isilon Storage Best Practices for Electronic Design Automation white paper.) Are heavy or light data reads and/or writes required? Are the data reads and writes more or less random or sequential?
  • If this is a latency-sensitive application, are there expected timeouts for data requests built into it? Can they be changed? Are there other external applications from this application that might cause latency problems, such as FindFirstFile or FindNextFile crawls to do repetitive work that is not well-suited to NAS use and needs to be investigated?
  • If an application can benefit from caching, how much of the unique dataset will be read once and then reread and over what periods of time (hourly, daily, weekly, and so on.) This can help in cluster sizing, with respect to L2 cache benefits and additional L3 cache opportunities within OneFS.

Determine How Users Interact with the Application

This is a little more difficult to profile. Understand what performance numbers users are accustomed to and what they are expecting. You need to determine:

  • If users will interact with the application through direct requests and responses from a flat data structure.
  • If there are efficient parallelized databases or flat file requests to derive a result, or if there are inefficient serialized requests. An example of the latter would be a CAD application that needs to load 10,000 objects from storage before a drawing can be rendered on the user's display.

Determine the Network Typology

Diagram the network topology completely. Leave nothing out. Pictures can resolve many issues.

  • For a LAN, itemize gear models, speeds, feeds, maximum transmission units (MTUs) per link, layer two and three routing, and expected latencies (confirm).
  • Perform a performance study using the iperf tool (distributed with OneFS) for network performance measurement. Perf_PC.bmp
  • For a WAN, itemize providers, topologies, rate guarantees, direct versus indirect pathways, and perform an iperf study.

Take a look at your organization's change control process. Network and storage teams may not communicate clearly, especially in a SAN house that is only beginning to use NAS storage. Find out who fixed something last if something went wrong, and determine what the fixes changed or touched.

Determine the Storage Stack Performance

You can determine storage stack performance by learning over time what your normal performance is, and how to recognize when it is not normal. All clusters have a unique configuration with a unique dataset and workload, and therefore you are observing a unique result from your ecosystem.

Storage Fundamentals

Understand the input/output data rates per node, per node pool, and per network pool, as applicable.


File Server Protocol Operational Rates

Understand the input/output protocol read/write rates as a whole in the workload.

  • Use the isi statistics protocol –-nodes all –-top –-orderby=timeavg command to display performance statistics by protocol.
  • Understand the per-protocol breakouts of the client requests—in particular, the read, write, getattr, setattr, open, close, create, and delete operations.

Disk Hardware Latency

This is the sloth of your storage stack. Understand the impact of non-cached workflow and transfer rates (xfers) to disks. The transfer rates will lead to an understanding of how a unique dataset—and the unique use of it—will deliver a unique result.

  • Use the isi statistics drive –-nodes all –-top –-long –-orderby timeavg command to display performance statistics by drive, ordered by OpsIn and OpsOut values. Note that this command is not measuring physical disk input/output operations per second (IOPS); it measures software transfer commands to storage only. Disks manage their own physical ordering of these requests, which OneFS does not see or measure in the form of physical I/O operations (IOPS). Mentally adjust the OpsIn and OpsOut fields to reflect that reality.
  • It is very important to profile your workload by using at least the isi statistics commands. Use them to understand how an application drives the workload to and from disks.
  • There are no available hard numbers of transfers to disk that can determine how much is too much to cause performance degradation. OneFS simply does not deliver data from that deep in the storage stack to make this an easy operation. Keep an eye on the Busy%, Queued, TimeInAQ, and TimeAvg columns returned from these commands to make judgments on whether your storage layer is being overwhelmed, according to your performance requirements.


Node CPU Utilization


Use the isi statistics system –-nodes --top command to display statistics for CPU performance on individual nodes.


Understand that CPU is not a leading indicator of a workload; it is a result of a workload. CPU is an important consideration in

sizing against your existing gear, but it is not useful to otherwise profile your workload.


Note that writes will be much more impactful to CPU than reads. You will need to work out your write plan and calculate forward error correction (FEC).


When to Analyze Your Workloads


Analyze your workload performance when disruption has occurred or any time performance changes. For example, when an application upgrade is performed, a new functionality enabled, or a migration to or from a new pool. For help with analyzing your workload results, contact your EMC or Partner account team.