In an Isilon cluster, not all Job Engine jobs run equally fast.
For example, a job which is based on a file system tree walk will run slower on a cluster with a very large number of small files than on a cluster with a low number of large files. Similarly, jobs which compare data across nodes, such as Dedupe, will run more slowly where there are many more comparisons to be made.
Many factors play into this, and true linear scaling is not always possible. If a job is running slowly the first step is to discover what the specific context of the job is.
There are three main methods for jobs, and their associated processes, to interact with the file system:
- Via metadata, using a LIN scan. An example of this is IntegrityScan, when performing an on-line file system verification.
- Traversing the directory structure directly via a tree walk. For example, QuotaScan, when performing quota domain accounting.
- Directly accessing the underlying cylinder groups and disk blocks, via a linear drive scan. For example, MediaScan, when looking for bad disk sectors.
Each of these approaches has its pros and cons, and will suit particular jobs. The specific access method influences the run time of a job. For instance, some jobs are unaffected by cluster size, others slow down or accelerate with the more nodes a cluster has, and some are highly influenced by file counts and directory depths.
For a number of jobs, particularly the LIN-based ones, the job engine will provide an estimated percentage completion of the job during runtime. This is covered in the following blog article:
With LIN scans, even though the metadata is of variable size, the job engine can fairly accurately predict how much effort will be required to scan all LINs. The data, however, can be of widely-variable size, and so estimates of how long it will take to process each task will be a best reasonable guess.
For example, the job engine might know that the highest LIN is 1:0009:0000. Assuming the job will start with a single thread on each of three nodes, the coordinator evenly divides the LINs into nine ranges: 1:0000:0000-1:0000:ffff, 1:0001:0000-1:0001:ffff, etc., through 1:0008:0000-1:0009:0000. These nine tasks would then be divided between the three nodes. However there is no guaranty that each range will take the same time to process.
For example, the first range may have fewer actual LINs, as a result of old LINs having been deleted, so complete unexpectedly fast. Perhaps the third range contains a disproportional number of large files and so takes longer to process. And maybe the seventh range has heavy contention with client activity, also resulting in an increased execution time. Despite such variances, the splitting and redistribution of tasks across the node manager processes alleviates this issue, mitigating the need for perfectly-fair divisions at the onset.
Priorities play a large role in job initiation and it is possible for a high priority job to significantly impact the running of other jobs. This is by design, since FlexProtect should be able to run with a greater level of urgency than SmartPools, for example. However, even with intelligent impact management, sometimes this can be an inconvenience. This is why the storage administrator has the ability to manually control the impact level and relative priority of jobs.
Certain jobs like FlexProtect have a corresponding job provided with a name suffixed by ‘Lin’, for example FlexProtectLin. This indicates that the job will automatically, where available, use an SSD-based copy of metadata to scan the LIN tree, rather than the drives themselves. Depending on the workflow, this will often significantly improve job runtime performance.
The following blog post provides a discussion of the Job Engine's architecture and distributed work allocation: