The OneFS Job Engine runs across the entire cluster and is responsible for dividing and conquering large storage management and protection tasks. To achieve this, the engine reduces a task into smaller work items and then allocates, or maps, these portions of the overall job to multiple worker threads on each node. Progress is tracked and reported on throughout job execution, and a detailed report and status is presented upon completion or termination.
The Job Engine is based on a delegation hierarchy made up of coordinator, director, manager, and worker processes.
There are other threads that are not illustrated in this diagram, which relate to the Job Engine internal functions, such as communication between the various Job Engine daemons, and collection of statistics. Also, with three jobs running simultaneously, each node would have three manager processes, each with its own number of worker threads.
Once the work is initially allocated, the Job Engine uses a shared work distribution model in order to execute the work, and each job is identified by a unique job ID. When a job is launched, whether it is scheduled, started manually, or responding to a cluster event, the Job Engine spawns a child process from the isi_job_d daemon running on each node. This Job Engine daemon is also known as the parent process.
The entire Job Engine orchestration is handled by the coordinator, which is a process that runs on one of the nodes in a cluster. Any node can act as the coordinator, and the principal responsibilities include:
- Monitoring work load and the constituent nodes' status
- Controlling the number of worker threads on a per-node and cluster-wide basis
- Managing and enforcing job synchronization and checkpoints
While the actual work item allocation is managed by the individual nodes, the coordinator node takes control, divides up the job, and evenly distributes the resulting tasks across the nodes in the cluster. For example, if the coordinator needs to communicate with a manager process running on node five, it first sends a message to node five’s director, which then passes the message on down to the appropriate manager process under its control. The coordinator also periodically sends messages, via the director processes, instructing the managers to increment or decrement the number of worker threads.
The coordinator is also responsible for starting and stopping jobs, as well as processing work results as they are returned during the execution of a job. Should the coordinator process die for any reason, the coordinator responsibility is automatically transferred to another node.
The coordinator node can be identified via the following CLI command:
# isi job status --verbose | grep Coordinator
Each node in the cluster has a Job Engine director process, which runs continuously and independently in the background. The director process is responsible for monitoring, governing, and overseeing all Job Engine activity on a particular node, and is constantly waiting for instruction from the coordinator to start a new job. The director process serves as a central point of contact for all the manager processes running on a node, and as a liaison with the coordinator process across nodes. These responsibilities include:
- Creating a manager process
- Delegating to and requesting work from other peers
- Sending and receiving status messages
The manager process is responsible for arranging the flow of tasks and task results throughout the duration of a job. The manager processes request and exchange work with each other, and supervise the worker threads assigned to them. At any point, each node in a cluster can have up to three manager processes—one for each job currently running. These managers are responsible for overseeing the flow of tasks and task results.
Each manager controls and assigns work items to multiple worker threads working on items for the designated job. Under direction from the coordinator and director, a manager process maintains the appropriate number of active threads for a configured impact level, and for the node’s current activity level. Once a job has completed, the manager processes associated with that job across all the nodes are terminated. New managers are automatically spawned when the next job is moved into execution.
The manager processes on each node regularly send updates to their respective node’s director, which, in turn, informs the coordinator process of the status of the various worker tasks.
Each worker thread is given a task, if available, which it processes item by item until the task is complete or the manager unassigns the task. The status of the nodes’ workers can be queried by running the CLI command “isi job statistics view”. In addition to the number of current worker threads per node, a sleep-to-work (STW) ratio average is also provided, which provides an indication of the worker thread activity level on the node.
Toward the end of a job phase, the number of active threads decreases as workers finish up their allotted work and become idle. Nodes that have completed their work items just remain idle, waiting for the last remaining node to finish its work allocation. When all tasks are finished, the job phase is complete and the worker threads are terminated.
As jobs are processed, the coordinator consolidates the task status from the constituent nodes and periodically writes the results to checkpoint files. These checkpoint files allow jobs to be paused and resumed, either proactively, or in the event of a cluster outage. For example, if the node on which the Job Engine coordinator was running went offline for any reason, a new coordinator would be automatically started on another node. This new coordinator would read the last consistency checkpoint file, job control and task processing would resume across the cluster from where it left off, and no work would be lost.
Resource Utilization Monitoring
The Job Engine resource monitoring and execution framework allows jobs to be throttled based on both CPU and disk I/O metrics. The granularity of the resource utilization monitoring data provides the coordinator process with visibility into exactly what is generating I/Os per second on any particular drive across the cluster. This level of insight allows the coordinator to make very precise determinations about exactly where and how impact control is best applied. As we will see, the coordinator itself does not communicate directly with the worker threads, but rather with the director process, which in turn instructs a node’s manager process for a particular job to cut back threads.
For example, if the Job Engine is running a low-impact job and CPU utilization drops below the threshold, the worker thread count is gradually increased up to the maximum defined by the “low” impact policy threshold. If client load on the cluster suddenly spikes for some reason, then the number of worker threads is gracefully decreased. The same principle applies to disk I/O, where the Job Engine will throttle back in relation to both I/Os per second as well as the number of I/O operations waiting to be processed in any drive’s queue. Once client load has decreased again, the number of worker threads is correspondingly increased to the maximum ”low” impact threshold.
In summary, detailed resource utilization telemetry allows the Job Engine to automatically tune its resource consumption to the desired impact level and customer workflow activity.
Throttling and Flow Control
Certain jobs, if left unchecked, could consume vast quantities of a cluster’s resources, contending with and impacting client I/O. To counteract this, the Job Engine employs a comprehensive work throttling mechanism that is able to limit the rate at which individual jobs can run. Throttling is employed at a per-manager process level, so job impact can be managed both granularly and gracefully.
Every 20 seconds, the coordinator process gathers cluster CPU and individual disk I/O load data from all the nodes across the cluster. The coordinator uses this information, in combination with the job impact configuration, to decide how many threads may run on each cluster node to service each running job. This can be a fractional number, and fractional thread counts are achieved by having a thread sleep for a given percentage of each second.
Using this CPU and disk I/O load data, every 60 seconds the coordinator evaluates how busy the various nodes are and makes a job throttling decision, instructing the various Job Engine processes as to the actions they need to take. This enables throttling to be sensitive to workloads in which CPU and disk I/O load metrics yield different results. Additionally, there are separate load thresholds tailored to the different classes of drives utilized in Isilon clusters, including high-speed SAS drives, lower-performance SATA disks, and flash-based solid-state drives (SSDs).
The Job Engine allocates a specific number of threads to each node by default, thereby controlling the impact of a workload on the cluster. If little client activity is occurring, more worker threads are spun up to allow more work—up to a predefined worker limit. For example, the worker limit for a low-impact job might allow one or two threads per node to be allocated, a medium-impact job might allow from four to six threads, and a high-impact job might allow a dozen or more. When this worker limit is reached (or before, if client load triggers impact management thresholds first), worker threads are throttled back or terminated.
Say, for example, a node has four active threads, and the coordinator instructs it to cut back to three. The fourth thread is allowed to finish the individual work item that it is currently processing, but it will then quietly exit, even though the task as a whole might not be finished. A restart checkpoint is taken for the exiting worker thread’s remaining work, and this task is returned to a pool of tasks requiring completion. This unassigned task is then allocated to the next worker thread that requests a work assignment, and processing continues from the restart checkpoint. This same mechanism applies in the event that multiple jobs are running simultaneously on a cluster.
Job progress tracking and reporting is covered in the following blog article: