Hello world,

 

I’m an engineer at EMC.  This blog will demonstrate some use cases of the iiq_data_export utility for exporting FSA data for offline analysis and custom charge-back processing.

 

Back story

 

If you run OneFS 7.x code with InsightIQ, you can use the iiq_data_export utility to extract performance and file system reports.  While that’s great when it works, a very common gripe we hear is that the FS Analyze (FSA) job can take “forever” to complete.  The FSA job is an OneFS job that crawls the filesystem to gather metadata information about files.  InsightIQ consumes the results of the FSA job for filesystem reports.  Therefore, when the FSA job takes a long time to finish (days or even weeks), up-to-date file system reports aren’t readily available.  The FSA job in OneFS 7.x conducts a LIN-tree based scan on the filesystem to gather metadata about all files on the cluster.  If your cluster has a few billion files, the LIN-tree based scan can take a while to finish.

 

With the general availability of OneFS version 8.0.0 earlier this year, anyone is now able to deploy the latest version of OneFS in test environments and production clusters.  One of the first few things noticed is the drastic improvements made in the FSA job completion time.  And there are good reasons for it.

 

For OneFS 8.0, Isilon Engineering dedicated substantial efforts to overhaul how the FSA job works.  The FSA job now performs a change-list based scan instead of a LIN-tree based scan.  What this means is that only the files that have changed since the last FSA job completion are scanned.  This cuts down the amount of time it takes to aggregate metadata, even for a cluster that hosts a few billion files.

 

Of course, in order for this feature to work, your cluster must be on OneFS 8.x code, and that the first LIN-tree based scan has to complete. Subsequent FSA jobs will only touch the changed files since the last FSA completion.  I should also note that the subsequent FSA jobs will be fully independent of previous FSA results.  Meaning, if FSA result “A” comes from a full LIN-tree based scan, and FSA result “B” comes from a subsequent change-list based scan, you can delete or unpin result “A” and still have full file system reporting capability by using result “B”.

 

Reports from the field tell us that FSA jobs complete on a daily basis.  Daily export of FSA data is now possible. This opens doors for interesting uses of the FSA result data.  One potential use of reliable, fresh-daily FSA results is the ability to track application and project-use of Isilon storage for the purpose of charge-back.

 

This blog answers a very common inquiry about how to programmatically export storage consumption for project folders under /ifs instead of using the InsightIQ web interface and iterate through individual directories.

 

The location of the iiq_data_export utility

 

The iiq_data_export utility resides on the Linux server that runs the InsightIQ application.  Depending on your deployment method, your InsightIQ might be running on a physical Linux box or a virtual machine.  In either case, you need ssh access to the server.  The iiq_data_export utility can be executed by a non-root user.

 

The FSA export option of iiq_data_export

 

The iiq_data_export utility has two major functional areas.  It allows you to export performance stats or file system analytics data.  We’ll cover some uses of iiq_data_export for file system analytics in this blog.  Specifically, we will look at the “directory” data-module export option.

 

Using iiq_data_export, list FSA results for the cluster being monitored

 

The command to list available FSA results for the cluster is:

               iiq_data_export fsa list --reports <cluster_name>

 

Example, let’s suppose I have a cluster named “tme-sandbox”:

[10:41:39] rchang@VNODE0100:[~]:iiq_data_export fsa list --reports tme-sandbox

 

    Available Reports for: tme-sandbox Time Zone: EDT

================================================================================

    |ID |FSA Job Start                |FSA Job End               |Size        |

================================================================================

    |449 |Jun 06 2016, 10:00 PM        |Jun 06 2016, 10:31 PM     |4.806G      |

--------------------------------------------------------------------------------

    |455 |Jun 07 2016, 10:00 PM        |Jun 07 2016, 10:31 PM     |4.819G      |

--------------------------------------------------------------------------------

    |461 |Jun 08 2016, 10:00 PM        |Jun 08 2016, 10:30 PM     |4.817G      |

--------------------------------------------------------------------------------

    |467 |Jun 09 2016, 10:00 PM        |Jun 09 2016, 10:32 PM     |4.801G      |

--------------------------------------------------------------------------------

    |473 |Jun 10 2016, 10:00 PM        |Jun 10 2016, 10:30 PM     |92.933G     |

--------------------------------------------------------------------------------

    |479 |Jun 11 2016, 10:00 PM        |Jun 11 2016, 10:31 PM     |4.908G      |

--------------------------------------------------------------------------------

    |486 |Jun 12 2016, 10:00 PM        |Jun 12 2016, 10:31 PM     |4.816G      |

--------------------------------------------------------------------------------

    |492 |Jun 13 2016, 10:00 PM        |Jun 13 2016, 10:32 PM     |4.794G      |

--------------------------------------------------------------------------------

    |498 |Jun 14 2016, 10:00 PM        |Jun 14 2016, 10:30 PM     |4.816G      |

================================================================================

 

The ID column is the job number that is associated with that particular FS Analyze job engine job.  This is the ID number that you will provide to iiq_data_export to extract capacity information for your directory.

 

Exercise 1: Export first-level directories under /ifs

The command to export the first-level directories under /ifs from a specified cluster, for a specific FSA job is:

               iiq_data_export fsa export -c <cluster_name> --data-module directories -o <jobID>

 

Let’s suppose I want a listing of all first-level directories under /ifs, from FSA job ID 473, I would use the “data-module directories” option as follows:

 

[14:34:56] rchang@VNODE0100:[~/blog-work]:iiq_data_export fsa export -c tme-sandbox --data-module directories -o 473

    Successfully exported data to: directories_tme-sandbox_473_1467236098.csv


The resulting CSV file can be parsed through Excel or using another programmatic manner to derive the capacity consumption of the directories.  The output shows directory count, file counts, logical, and physical capacity consumption.  Example:

CSV-parse.png

 

Exercise 2: Export specific directories under /ifs including 2nd and 3rd level directories

 

Now suppose you want the capacity information for a specific directory that is nested somewhere under the /ifs branch, you would use the “directory filter” option (short hand -r).  Syntax is as follows:

 

iiq_data_export fsa export -c <cluster-name> --data-module directories -o <jobID> -r directory:<directory_under_ifs>

 

For example, the command below will extract directory information for /ifs/data/hdfs_dogfooding:


[16:14:31] rchang@VNODE0100:[~]:iiq_data_export fsa export -c tme-sandbox --data-module directories -o 473 -r directory:data/hdfs_dogfooding

 

 

    Successfully exported data to: directories_tme-sandbox_473_1466032486.csv

 

A quick look at this output file shows:

 

path[directory:/ifs/data/hdfs_dogfooding/],dir_cnt (count),file_cnt (count),ads_cnt,other_cnt (count),log_size_sum (bytes),phys_size_sum (bytes),log_size_sum_overflow,report_date: 1465610442

/ifs/data/hdfs_dogfooding/user,52,2202,0,0,105103857955,136633898496,0

/ifs/data/hdfs_dogfooding/Shipit,14,12,0,0,337408759,340820992,0

/ifs/data/hdfs_dogfooding/benchmarks,5,22,0,0,104859576,134982144,0

/ifs/data/hdfs_dogfooding/tmp,35,21,0,0,4514730,7365120,0

/ifs/data/hdfs_dogfooding/hbase,27,32,0,0,11551,1380864,0

/ifs/data/hdfs_dogfooding/solr,1,0,0,0,0,2560,0

/ifs/data/hdfs_dogfooding/pyhdfs,1,0,0,0,0,2560,0

 

Caveats

 

There are a number of caveats around the iiq_data_export command that I should note:

 

  1. Currently with a single execution of the iiq_data_export command, we cannot extract more than one specific directory with the directory filter.  For example, if you issued -r directory:home -r directory:data/hdfs_dogfooding, only the second filter will be picked up by the command.  I will offer a quick bash script below to iterate through a user-defined list of directories.
  2. How far you can go down the /ifs tree depends on the FSA configuration from within InsightIQ. InsightIQ by default configures “directory filter maximum depth” to 5.  Meaning by default you could extract directory information as low as /ifs/level1/level2/level3/level4/level5.  Should you need to extract project directories from deeper than level 5, you could configure the FSA job to crawl deeper from within InsightIQ, as shown in the following image:

FSA-depth-config.png

Keep in mind that the larger the maximum depth, the more storage individual FSA result will consume on the cluster.

 

FSA Extraction Script

 

Here’s a simple bash script I whipped up to iterate through a list of directories.

 

Create a simple file with the list of directories under /ifs you’d like to extract:

[15:33:28] rchang@VNODE0100:[~/blog-work]:cat dir_list.input

data/hdfs_dogfooding/Shipit/DB-Directory

data

home

 

 

Create this script (remember to substitute the cluster name and the job ID):

[15:33:33] rchang@VNODE0100:[~/blog-work]:cat export-fsa.bash

for i in `cat dir_list.input`

do

   echo "Processing $i..."

   j=`basename $i`;

   echo "Basename is $j"

current_date_time="`date +%Y_%m_%d_%H%M%S_`";

iiq_data_export fsa export -c tme-sandbox --data-module directories -o 473 -r directory:$i -n fsa_export_$current_date_time$j.csv

done

 

Once executed, the resulting CSV file has the timestamp plus the directory’s base name:

[15:34:05] rchang@VNODE0100:[~/blog-work]:. export-fsa.bash

Processing data/hdfs_dogfooding/Shipit/DB-Directory...

Basename is DB-Directory

 

 

Successfully exported data to: fsa_export_2016_06_29_153408_DB-Directory.csv

 

 

Processing data...

Basename is data

 

 

Successfully exported data to: fsa_export_2016_06_29_153410_data.csv

 

 

Processing home...

Basename is home

 

 

    Successfully exported data to: fsa_export_2016_06_29_153411_home.csv

 

 

That’s it.  I welcome any comments and feedback.  Happy exporting!