NOTE: This topic is part of the Uptime Information Hub.
The EMC Isilon OneFS Job Engine is at the core of the Isilon work distribution system. It is a parallel scheduling and job management framework that enables data protection and storage management tasks to be efficiently distributed across an Isilon cluster. The Job Engine minimizes manual system administration operations. It optimizes system performance and cluster health through automated job scheduling and resource management.
The Job Engine runs maintenance jobs, such as FlexProtect and AutoBalance, across an entire cluster and is responsible for dividing and conquering large storage management and protection jobs. It divides these jobs into smaller tasks and work items, and then allocates them to worker threads on each node. This helps ensure a minimal impact to cluster performance during job execution. Once the work is allocated, the Job Engine uses a shared work distribution model in order to perform the work, and each job is identified by a unique job ID.
OneFS monitors node CPU load and drive I/O activity per worker thread every 20 seconds to ensure that maintenance jobs do not cause cluster performance problems. If a job affects overall system performance, the OneFS Job Engine reduces the activity of maintenance jobs and yields resources to clients.
OneFS leverages the power and efficiency of parallel processing wherever possible. An example of this is multiple nodes participating in data reconstruction with FlexProtect in the event of a hard drive failure. This article provides an overview of jobs that can run concurrently, job priorities, and job impact policies.
Running Concurrent Jobs
In OneFS 7.0 and earlier, the Job Engine can run only a single job at a time. Beginning with OneFS 7.1, the Job Engine can run up to three concurrent jobs, allowing routine background cluster maintenance jobs to continue on schedule. Exceptions to this are restriping and marking jobs, which are classes of jobs that are grouped into exclusion sets.
Running a concurrent job is governed by the following criteria:
- Job priority and job impact policies.
- Job exclusion sets, which contain jobs that cannot run simultaneously because they perform similar functions (for example, FlexProtect and AutoBalance).
- Cluster health, because most jobs will not run or should be disabled when the cluster is in a degraded state.
- FlexProtect, because this is the only job allowed to run by default if a cluster is in degraded mode. Other jobs will automatically be paused, and will not resume until FlexProtect has completed and the cluster is healthy again.
When more than three jobs—with the same priority level and no exclusion set restrictions—are scheduled to run simultaneously, the three jobs with the lowest job ID value will run, and the remainder will be paused.
Each Job Engine job is assigned a default priority and impact policy. These control the impact of jobs on cluster performance as the job is running. Priority takes effect when two or more jobs are queued to run; the Job Engine determines when the job can run. The FlexProtect, FlexProtectLin, and IntegrityScan jobs have the highest Job Engine priority level of 1 by default. Of these, FlexProtect is the most important because of its core role in re-protecting data.
Although the priority settings of the OneFS Job Engine jobs can be configured by the cluster administrator, Isilon strongly recommends keeping the default priority settings. This is particularly critical for the highest-priority jobs, where changes to priority settings can adversely affect the Job Engine’s ability to maintain cluster health.
Job Exclusion Sets
Job priority takes effect when two or more queued jobs belong to the same exclusion set. For multiple concurrent job execution, exclusion sets—or classes of similar jobs—determine which jobs can run simultaneously. A job is not required to be part of any exclusion set, and jobs may also belong to multiple exclusion sets.
Currently, there are two exclusion sets that jobs can be part of: restriping and marking.
Restriping Exclusion Set
OneFS protects data by writing file blocks across multiple drives and nodes. This process is known as restriping. The OneFS Job Engine defines a restriping exclusion set as one that contains jobs that involve file system management, protection, and on-disk layout.
The restriping exclusion set includes:
- AutoBalance and AutoBalanceLin
- FlexProtect and FlexProtectLin
If a restriping job is running and another restriping job is queued, the higher priority job will either continue running, or if priorities change, it will be paused and the higher priority job will start.
Two important jobs that are in the restriping exclusion set are SmartPools and SetProtectPlus. These are meant to restripe data in accordance with protection policy. Two other restriping jobs, FlexProtect and FlexProtectLin, are important as well, because of their core roles in re-protecting data.
Marking Exclusion Set
OneFS marks blocks that are in use by the file system. For example, the IntegrityScan job traverses the live file system and marks every block of every LIN in the cluster to proactively detect and resolve any issues with the structure of data in a cluster. Similar to jobs in the restriping exclusion set, mark jobs—such as Collect, MultiScan, and IntegrityScan—cannot run at the same time, as they perform similar marking functions.
Some jobs may belong to both exclusion sets. An example of this is MultiScan, which includes both the AutoBalance (restriping) and Collect (marking) jobs.
Feature Support Jobs
A maximum of three feature support jobs, or other non-excluded job combinations, can run concurrently.
The majority of feature support jobs do not belong to an exclusion set. These include:
- SmartDedupe (in OneFS 7.1)
- DedupeAssessment (in OneFS 7.1)
These can co-exist and contend with any of the other jobs.
Job Impact Policies
Impact policies limit the system resources that a job can consume, as well as when a job can run. Most jobs run in the background and are set to low impact by default. This means that the job will consume a minimum amount of cluster resources. Notable exceptions are FlexProtect jobs, which by default are set to medium impact. This allows FlexProtect to quickly and efficiently re-protect data without critically impacting other user activities. The IntegrityScan job, which verifies file system integrity, is also set to medium by default and is started manually.
The four available impact levels are paused, low, medium, and high. This degree of granularity allows impact intervals and levels to be configured per job, in order to ensure smooth cluster operation.
An impact policy can consist of one or many impact intervals, which are blocks of time within a given week. Each impact interval can be configured to use a single predefined impact level that specifies the amount of cluster resources to use for a particular cluster operation. You can associate jobs with impact policies to ensure that certain vital jobs always have access to system resources.
Isilon recommends keeping the default impact and priority settings where possible, unless there is a valid reason to change them. An example might be running IntegrityScan with a low impact policy to minimize cluster performance impacts, since file system integrity checks can take several days to run on large clusters.
Using Job Engine for Optimal Cluster Performance
When updating the default settings for priority, schedule, and impact profile for a job, consider the following:
- Which resources will I be impacting?
- What would I be gaining or losing if I reprioritized this job?
- What are my impact options and their respective benefits and drawbacks?
- How long will the job run, and what other jobs will it contend with?
For additional technical details regarding the OneFS Job Engine framework, read the EMC Isilon OneFS Job Engine white paper.