Over the course of the last several articles we’ve looked at some of the new features of the OneFS 8.2 release. In this one we’ll explore the topic of dataset performance monitoring and the new functionality that 8.2 brings to the table in this regard.


As clusters increase in size and the number of competing workloads place demands on system resources, more visibility is required in order to share cluster resources equitably. OneFS partitioned performance monitoring helps define, monitor and react to perf-related issues on the cluster. Administrators can pinpoint resource consumers to identify rogue workloads and/or users that consume excessive system resources.


Adding support for the SMB protocol in OneFS 8.2, partitioned performance monitoring can now be used to define workload(s) - any combination of directories, shares, users, clients and access zones - and view the associated performance statistics (protocols/disk ops, read/write bandwidth, CPU, IOPs etc). Customized settings and filters can be configured to match specific workloads for a dataset that meets the required criteria, and reported statistics are refreshed every 30 seconds. Workload monitoring is also key for show-back and charge-back resource accounting.

 

Category

Description

Example

Workload

A set of identification metrics and resource consumption metrics.

{username:bob, zone_name:System} consumed {cpu:1.2s, bytes_in:10K, bytes_out:20M, …}

Dataset

A specification of identification metrics to aggregate workloads by, and the workloads collected that match that specification.

{username, zone_name}

Filter

A method for including only workloads that match specific identification metrics.

{zone_name:System}

 

Each resource listed below is tracked by certain stages of partitioned performance monitoring to provide statistics within a performance dataset, and for limiting specific workloads.


Resource Name

Definition

First Introduced

CPU Time

Measures CPU utilization. There are two different measures of this at the moment; raw measurements are taken in CPU cycles, but they are normalized to microseconds before aggregation.

OneFS 8.0.1

Reads

A count of blocks read from disk (including SSD). It generally counts 8 KB file blocks, though 512-byte inodes also count as a full block. These are physical blocks, not logical blocks, which doesn’t matter much on the read side but is important when analyzing writes.

OneFS 8.0.1

Writes

A count of blocks written to disk; or more precisely, to the journal. As with reads, 512-byte inode writes are counted as full blocks; for files, 8 KB blocks. Since these are physical blocks, writing to a protected file will count both the logical file data and the protection data.

OneFS 8.0.1

L2 Hits

A count of blocks found in a node’s L2 (Backend RAM) cache on a read attempt, avoiding a read from disk.

OneFS 8.0.1

L3 Hits

A count of blocks found in a node’s L3 (Backend SSD) cache on a read attempt, replacing a read from disk with a read from SSD.

OneFS 8.0.1

Protocol Operations

SMB Only in OneFS 8.2

  • For SMB 1, this is the number of ops (commands) on the wire with the exception of the NEGOTIATE op.
  • For SMB 2/3 this is the number of chained ops (commands) on the wire, with the exception of the NEGOTIATE op.
  • The counted op for chained ops will always be the first op.
  • SMB NEGOTIATE ops will not be associated with a specific user.

OneFS 8.2

Bytes In

A count of the amount of data received by the server from a client, including the application layer headers but not including TCP/IP headers.

OneFS 8.2

Bytes Out

A count of the amount of data sent by the server to a client, including the application layer headers but not including TCP/IP headers.

OneFS 8.2

Read/Write/Other Latency Total

Sum of times taken from start to finish of ops as they run through the system identical to that provided by isi statistics protocol. Specifically, this is the time in between LwSchedWorkCreate and the final LwSchedWorkExecuteStop for the work item. Latencies are split between the three operations types, read/write/other, with a separate resource for each.

Use Read/Write/Other Latency Count to calculate averages

OneFS 8.2

Read/Write/Other Latency Count

Count of times taken from start to finish of ops as they run through the system identical to that provided by isi statistics protocol. Latencies are split between the three operations types, read/write/other, with a separate resource for each.

Used to calculate the average of Read/Write/Other Latency Total

OneFS 8.2

Workload Type

  • Dynamic (or blank) - Top-N tracked workload
  • Pinned - Pinned workload
  • Overaccounted - The sum of all stats that have been counted twice within the same dataset, used so that a workload usage % can be calculated.
  • Excluded - The sum of all stats that do not match the current dataset configuration. This is for workloads that do not have an element specified that is defined in the category, or for workloads in filtered datasets that do not match the filter conditions.
  • Additional - The amount of resources consumed by identifiable workloads not matching any of the above. Principally any workload that has dropped off of the top-n.
  • System - The amount of resources consumed by the kernel.
  • Unknown - The amount of resources that we could not attribute to any workload, principally due to falling off of kernel hashes of limited size.

OneFS 8.2

 

Identification Metrics are the client attributes of a workload interacting with OneFS through Protocol Operations, or System Jobs or Services. They are used to separate each workload into administrator-defined datasets.


Metric Name

Definition

First Introduced

System Name

The system name of a given workload. For services started by isi_mcp/lwsm/isi_daemon this is the service name itself. For protocols this is inherited from the service name. For jobs this is the job id in the form "Job: 123".

OneFS 8.0.1

Job Type + Phase

A short containing the job type as the first n bytes, and the phase as the rest of the bytes. There are translations for job type to name, but not job phase to name.

OneFS 8.0.1

Username

The user as reported by the native token. Translated back to username if possible by IIQ / stat summary view.

OneFS 8.2

Local IP

IP Address, CIDR Subnet or IP Address range of the node serving that workload. CIDR subnet or range will only be output if a pinned workload is configured with that range. There is no overlap between addresses/subnets/ranges for workloads with all other metrics matching.

OneFS 8.2

Remote IP

IP Address, CIDR Subnet or IP Address range of the client causing this workload. CIDR subnet or range will only be output if a pinned workload is configured with that range. There is no overlap between addresses/subnets/ranges for workloads with all other metrics matching.

OneFS 8.2

Protocol

Protocol enumeration index. Translated to string by stat.

OneFS 8.2

Zone

The zone id of the current workload. If zone id is present all username lookups etc should use that zone, otherwise it should use the default "System" zone. Translation to string performed by InsightIQ / summary view.

OneFS 8.0.1

Group

The group that the current workload belongs to. Translated to string name by InsightIQ / summary view. For any dataset with group defined as an element the primary group will be tracked as a dynamic workload (unless there is a matching pinned workload in which case that will be used instead). If there is a pinned workload/filter with a group specified, the additional groups will also be scanned and tracked. If multiple groups match then stats will be double accounted, and any double accounting will be summed in the "Overaccounted" workload within the category.

OneFS 8.2

IFS Domain

The partitioned performance IFS domain and respective path LIN that a particular file belongs to, determined using the inode. Domains are not tracked using dynamic workloads unless a filter is created with the specified domain. Domains are created/deleted automatically by configuring a pinned workload or specifying a domain in a filter. A file can belong to multiple domains in which case there will be double accounting within the category. As with groups any double accounting will be summed in the "Overaccounted" workload within the category. The path must be resolved from the LIN by InsightIQ or the Summary View.

OneFS 8.2

SMB Share Name

The name of the share that the workload is accessing through, provided by the smb protocol. Also provided at the time of actor loading are the Session ID and Tree ID to improve hashing/dtoken lookup performance within the kernel.

OneFS 8.2

 

So how does this work in practice? From the CLI, the following command syntax can be used to create a standard performance dataset monitor:


# isi performance dataset create –-name <name> <metrics>


For example:


# isi performance dataset create --name my_dataset username zone_name


To create a dataset that requires filters, use:


# isi performance dataset create –-name <name> <metrics> –-filters <filter-metrics>


For example:


# isi performance dataset create --name my_filtered_dataset username zone_name --filters zone_name


Other performance dataset operation commands include:


# isi performance dataset list


# isi performance dataset view <name|id>


# isi performance dataset modify <name|id> --name <new_name>


# isi performance dataset delete <name|id>

 

A dataset will display the top 1024 workloads by default. Any remainder will be aggregated into a single additional workload.


If you want a workload to always be visible, it can be pinned using the following syntax:


# isi performance workload pin <dataset_name|id> <metric>:<value>


For example:


# isi performance workload pin my_dataset username:bob zone_name:System


Other workload operation commands include:


# isi performance workload list <dataset_name|id>


# isi performance workload view <dataset_name|id> <workload_name|id>


# isi performance workload modify <dataset_name|id> <workload_name|id> --name <new_name>


# isi performance workload unpin <dataset_name|id> <workload_name|id>


Multiple filters can also be applied to the same dataset. A workload will be included if it matches any of the filters. Any workload that doesn’t match a filter be aggregated into an excluded workload.


The following CLI command syntax can be sued to apply a filter:


# isi performance filter apply <dataset_name|id> <metric>:<value>


For example:


# isi performance filter apply my_filtered_dataset zone_name:System


Other filter options include:


# isi performance filter list <dataset_name|id>


# isi performance filter view <dataset_name|id> <filter_name|id>


# isi performance filter modify <dataset_name|id> <filter_name|id> --name <new_name>


# isi performance filter remove <dataset_name|id> <filter_name|id>


Viewing Statistics:


# isi statistics workload –-dataset <dataset_name|id>

# isi statistics workload --dataset my_dataset

    CPU BytesIn  BytesOut   Ops Reads  Writes   L2 L3  ReadLatency  WriteLatency OtherLatency  UserName   ZoneName WorkloadType

-------------------------------------------------------------------------------------------------------------------------------------

11.0ms 2.8M     887.4   5.5 0.0   393.7  0.3 0.0      503.0us       638.8us         7.4ms       bob System             -

  1.2ms 10.0K     20.0M  56.0 40.0     0.0  0.0 0.0        0.0us         0.0us         0.0us      mary System        Pinned

31.4us 15.1      11.7   0.1 0.0     0.0  0.0 0.0      349.3us         0.0us         0.0us       bob Quarantine             -

  1. 166.3ms      0.0 0.0   0.0    0.0 0.1  0.0  0.0 0.0us         0.0us         0.0us         -          - Excluded

31.6ms 0.0       0.0   0.0 0.0     0.0  0.0 0.0        0.0us         0.0us         0.0us         -          -        System

70.2us 0.0       0.0   0.0 0.0     3.3  0.1 0.0        0.0us 0.0us         0.0us         -          -       Unknown

  0.0us 0.0       0.0   0.0 0.0     0.0  0.0 0.0        0.0us         0.0us         0.0us         -          - Additional

  0.0us 0.0       0.0   0.0 0.0     0.0  0.0 0.0        0.0us         0.0us         0.0us         -          - Overaccounted

-------------------------------------------------------------------------------------------------------------------------------------

Total: 8

Includes standard statistics flags, i.e. --numeric, --sort, --totalby etc..

 

Other useful commands include the following:


To list all available identification metrics:


# isi performance metrics list

# isi performance metrics view <metric>


To view/modify the quantity of top workloads collected per dataset:


# isi performance settings view

# isi performance settings modify <n_top_workloads>

 

To assist with troubleshooting, the validation of the configuration is thorough and errors are output directly to the CLI. Name lookup failures, for example UID to username mappings, are reported in an additional column in the statistics output. Errors in the kernel are output to /var/log/messages and protocol errors are written to the respective protocol log.


Note that statistics are updated every 30 seconds and, as such, a newly created dataset will not show up in the statistics output until the update has occurred. Similarly, an old dataset may be displayed until the next update occurs.


A dataset with a filtered metric specified but with no filters applied will not output any workloads. Paths and Non-Primary groups are only reported if they are pinned or have a filter applied. Paths and Non-Primary groups may result in work being accounted twice within the same dataset, as they can match multiple workloads. The total amount over-accounted within a dataset is aggregated into the Over-accounted workload.

As mentioned previously, the only protocol supported in OneFS 8.2 is SMB. Other primary protocol monitoring support, such as NFS and HDFS, will be added in future releases.


In addition to protocol stats, OneFS also includes job performance resource monitoring, which provides statistics for the resources used by jobs - both cluster-wide and per-node. Available in a ‘top’ format, this command displays the top jobs and processes, and periodically updates the information.


For example, the following syntax shows, and indefinitely refreshes, the top five processes on a cluster:


# isi statistics workload --limit 5 –-format=top

 

last update:  2019-06-19T06:45:25 (s)ort: default

 

CPU   Reads Writes      L2    L3    Node  SystemName        JobType

  1. 1.4s  9.1k  0.0         3.5k  497.0 2     Job:  237         IntegrityScan[0]
  2. 1.2s  85.7  714.7       4.9k  0.0   1     Job:  238         Dedupe[0]
  3. 1.2s  9.5k  0.0         3.5k  48.5  1     Job:  237         IntegrityScan[0]
  4. 1.2s  7.4k  541.3       4.9k  0.0   3     Job:  238         Dedupe[0]
  5. 1.1s  7.9k  0.0         3.5k  41.6  2     Job:  237         IntegrityScan[0] 


The resource statistics tracked per job, per job phase, and per node include CPU, reads, writes, and L2 & L3 cache hits. Unlike the output from the ‘top’ command, this makes it easier to diagnose individual job resource issues, etc.