Also review: OneFS 8.1.1.0 and Cloudera CDH 5.13+ Support for Cloudera Navigator



The release of the OneFS 8.1.0.1 MR  sees the addition of OneFS support for Cloudera's Navigator application for Metadata Management with Isilon OneFS. The Cloudera Navigator Data Management component is a comprehensive data governance and data stewardship tool. This blog post will look to highlight how Navigator works and how integration with Isilon OneFS is enabled. It is intended to provide a high-level overview of the basic capabilities and how Navigator can be used in conjunction with Hadoop clusters and HDFS data.

 

With the inclusion of Navigator support in OneFS we now support the following data management tasks:

  • Browse and search data - Find owner, creation and modification dates, understand data origins and history
  • Lineage and provenance - Track data from its source and monitor downstream dependencies
  • Discovery and exploration - Add, review, customize, and update metadata about the objects contained in the Hadoop data store
  • Custom metadata and tagging - Add custom tags and information to data and objects in HDFS

 

As of this release of OneFS 8.1.0.x we do not support Navigator data audit capabilities.

 

Overview of Cloudera Navigator

Cloudera Navigator is a tool available to supplement Cloudera's Hadoop distribution CDH, it is a licensed feature that can be integrated to provide additional data tools to administrators and end users of Hadoop data sets. Navigator currently recognizes HDFS, Yarn, Impala and Hive as sources of information it can manage. It extracts information from these services periodically to provide additional insight into how data was created and managed when it was manipulated and by who. Using metadata and job history along with HDFS data feeds into various components of Navigator.

 

 

The main Search page of Cloudera Navigator-allows you to search and filter on many criteria, source, type, owner etc. to find the data you are looking for.

1.png


An example of the search interface based on one of the attributes of metadata - HDFS created in the last day

2.png

 

View detailed information on a File System Object

3.png

 

View additional Information on an Impala database

4.png


View Table information on Impala tables

5.png

 

 

Data Governance

One of the primary uses of Navigator is to monitor and track data in an HDFS workflow, one of the unique challenges with very large data sets is being able to track and monitor how data moves through the data analytics workflow. A key Navigator feature is the ability to link data between input and output through analytics jobs like mapred or data transformations on table-based data in Hive or Impala databases.

 

Navigator internal analyzes metadata and job history and links it together to generate lineage.

 

An example of simple lineage is how jobs create, ingest and output data in a simple terasuite run: teragen-terasort-teravalidate

6.png

 

 

Review table based lineage data (Impala), how data was ingested and transformed by database workflows: A simple TPC-DS table ingest and query.

7.png

 

Custom Metadata and Tagging

A very useful feature of Navigator is the ability to add custom metadata to objects; hdfs, yarn jobs, tables etc. This allows for easier searching and classification of data and can make it simpler to track and monitor data and usage of your data sets.

 

Add a custom tag, select a piece of data in Navigator and select Actions

8.png

 

 

9.png

 

Search on that custom tag

10.png

 

 

 

Review Data Analytics on the HDFS data

From an Administrative perspective, the ability to monitor and review who, when and how data is moving in the system can be very useful. Navigator provides and number of interfaces to review this information

 

HDFS Analytics - When, Who, Size of Data View

11.png


Review Data Stewardship - Review data trends within the HDFS data

12.png


13.png


 

Navigator can also enforce policies to manage data and auto tag new data within Navigator to facilitate data management. A number of the capabilities of Navigator including role-based access, purge management and custom properties are beyond the scope of this blog and Cloudera's documentation should be consulted for additional information.

 

Additional information on Cloudera Navigator can be found here:

Cloudera Navigator: Documentation Hub

Cloudera Navigator Data Management

 

 

 

Traditional HDFS Metadata Management

In a Direct Attached Storage (DAS) Hadoop cluster with a NameNode (NN) integrated deployment of HDFS, the NN's main role is to store the HDFS namespace; directory structures, file permissions and block ID's to files - all the metadata of the underlying data blocks. While this data is held in memory for operational use it is critical this data is persisted to disk for recovery and fault tolerance.

 

In normal HDFS this metadata is stored in two ways; FSImage and an edit log (INotify stream). The FSImage image is a complete point in time representation of the HDFS file systems metadata. The FSImage file is very efficient to read and is used on NN startup to load the metadata into memory, but it is very poor at handling incremental updates. So, rather than rewriting the FSImage all the time, all modifications to the HDFS file system are recorded in a transaction log (INotify stream). This provides the NN a number of capabilities; modifications can be tracked without having to constantly regenerate the FSImage file and in the event of a NN restart, the combination of the latest FSImage and INotify log can be replayed to provide an accurate view of the file system to that point in time.

 

Eventually, the HDFS cluster will need to construct a new FSImage that encompasses all INotify log file entries consolidated with the old FSImage directly into a new updated FSImage file to provide an updated point in time representation of the file system. This is known as checkpointing and is a resource-intensive operation. Also during this time, the NN has to restrict user access to the system, so instead of restricting access to the active NN, HDFS offloads this operation to the Secondary NameNode (SN) or a standby NN when operating in HA mode. The SN handles this merge of existing FSImage and INotify transaction logs and generates a new complete FSImage for the NN. At this time the latest FSImage can be used in conjunction with new INotify log files to provide the current file system. It is important that the checkpoints occur otherwise on NN restart the NN has to construct the entire HDFS metadata from the available FSImage and all INotify logs, this can take a long time and the HDFS file system will be unavailable while this occurs.

 

 

How Navigator Works

The Navigator Metadata Service accesses information through a number of ways; yarn applications logs, Hive and Impala applications and HDFS metadata through polling of the FSImage file and INotify transaction logs. It collects all this information and stores in within a Solr databases on the Hadoop cluster. Navigator then runs additional extractions and analytics on this data to create the data seen in Navigator. The ability to collect the underlying HDFS metadata from FSImage and INotify is critical to Navigators ability to view the file system and is why up until now OneFS based Hadoop clusters were unable to provide HDFS file system data to Navigator. Navigator’s primary behavior is to read an initial FSImage and then use the INotify logs to gain access to all file system updates that have occurred. It is possible under specific situations that  Navigator is required to refresh its data from a full FSImage, rather than leveraging the INotify log but this does not occur normally.

 

It is important to recognize Navigator data is not real-time but it periodically updates through polling and extraction to create the data views. This behavior is consistent with both DAS and Isilon based deployments and is how Navigator is designed to operate

 

 

OneFS and Metadata

Isilon OneFS when integrated into a Hadoop cluster provides the storage file system to the Hadoop cluster that is based on OneFS and not on an HDFS based file system.The layout and protection scheme is fundamentally different than HDFS and so is its management of metadata and blocks. Since OneFS is not a NN based HDFS file system and no NN is present in the Hadoop cluster, but rather OneFS provides NN and DataNode (DN) like functionality to the native OneFS system for the remote Hadoop cluster to access via the HDFS services and protocols. Our approach to handling file system allocation, block location and metadata management is fundamentally different than how a traditional Apache based HDFS file system manages its data and metadata.

 

The long and short of this is we don't rely on FSImage and INotify transaction log based metadata management within OneFS for HDFS data. In order to support the native OneFS capabilities as described in the Enterprise Features for Hadoop whitepaper and provide multiprotocol access, we use the underlying OneFS file system presented to the HDFS protocol for Hadoop access. Therefore we had no capabilities to provide a FSImage and INotify log for consumption by Navigator. Until now that is, with the release of MR 8.0.1.1, OneFS now includes the capability to integrate with Navigator by enabling a FSImage and INotify log stream on OneFS in a HDFS Access Zone. Enabling this feature in effect tells OneFS to create a FSImage file and start tracking HDFS file system events in an INotify log file that are available for consumption by Navigator in this case.

 

 

OneFS Support for Navigator

Since these components are now accessible to Navigator, OneFS based Hadoop can provide the required HDFS metadata to Navigator for inclusion and analytics. Once we enable a HDFS Hadoop Access Zone root for FSImage and INotify integration; OneFS effectively begins to mimic the behavior of a traditional NN deployment, a FSImage file is generated by OneFS and all HDFS file system operations are logged into an INotify stream. Periodically OneFS will regenerate a new FSImage, it is not true checkpointing and merging of the INotify log like a HDFS NN does, as the actual file system and operations are still handled by the core OneFS file system. The FSImage and INotify logs are generated to provide the required data to Navigator in the required format.

 

The FSImage regeneration job runs daily to recreate a current FSImage which combined with the current INotify logs will represent the current state of data and metadata in the HDFS root from a HDFS perspective.  At its heart OneFS is true multi-protocol filesystem which provides unified access to its data through many protocols; HDFS, NFS, SMB and others. Since only HDFS file system operations are captured by the INotify log, Navigator will only initially see this metadata, any metadata created in the HDFS data directories by NFS or SMB will not get included in the INotify stream. But, on the regeneration of a FSImage, these files will get included in the current FSImage and Navigator will see them the next time Navigator uses a later refreshed FSImage. Since Navigator primary method of obtaining updated metadata is based on INotify logs it may be a sometime before none-HDFS originating data is included. This is expected behavior and should be taken into account if multiprotocol workflows are in use

 

 

Using Navigator with OneFS

Within OneFS the FSImage and INotify features are Access Zone aware and should only be enabled on any Hadoop enabled Access Zone that will use Navigator, there is no reason to enable it on a zone that is not being monitored by Navigator, it will just add additional overhead to that the cluster on a feature that is not being consumed. In order to enable Navigator integration; both FSImage and INotify need to be enabled on the HDFS Access Zone. Once enabled, they should not be disabled unless the use of Navigator is to be permanently discontinued.

 

No additional configuration changes are required within Cloudera Manager or Navigator to enable functionality when integration is initially enabled it will take some time for the initial HDFS data to show within Navigator and additional time to generate linkage. As new data is added it will show and be linked based on the polling and extraction period within Navigator.

 

The following section outlines how to enable this feature within OneFS:

 

Enable FSImage on the HDFS Access Zone:

isi hdfs fsimage settings modify --enabled=true --zone=zone1-cdh --verbose

14.png

Review the status of FSImage:

isi hdfs fsimage settings view --zone=zone1-cdh      

15.png

Review the status of the FSimage job:

isi hdfs fsimage job view --zone=zone1-cdh       

16.png

Review the frequency of the FSImage job:

isi hdfs fsimage job settings view --zone=zone1-cdh

17.png

Review the latest FSImage:

isi hdfs fsimage latest view --zone=zone1-cdh

18.png

 

It may take some time for the initial FSImage to be generated.



Enable INotify on the HDFS Access Zone:

isi hdfs inotify settings modify --enabled=true --zone=zone1-cdh --verbose

19.png

Review the configuration of the INotify stream:

isi hdfs inotify settings view --zone=zone1-cdh

20.png

Review the INotify stream:

isi hdfs inotify stream view --zone=zone1-cdh

21.png

The Sync and Current ID’s will update periodically as you run this command.



When enabling integration of Cloudera Navigator and OneFS it can take a few hours for initial HDFS data to show up within Navigator based on the generation of a FSImage and INotify stream. This is expected behavior.


Post enablement and generation you will see HDFS objects in Navigator for browsing.

navenabled.png

 



An overview if FSImage and INotify commands in OneFS:

 

rsteven-3u7xf1k-1# isi hdfs fsimage job settings modify -

--generation-interval      -- The interval between successive FSImages.

--help                 -h  -- Display help for this command.

--verbose              -v -- Display more detailed information.

--zone                     -- Access zone.

 

 

rsteven-3u7xf1k-1# isi hdfs fsimage settings modify -

--enabled      -- Allow access to FSImage and start FSImage generation

--help     -h -- Display help for this command.

--verbose  -v  -- Display more detailed information.

--zone         -- Access zone.

 

 

rsteven-3u7xf1k-1# isi hdfs inotify settings modify -

--enabled            -- Enable the collection of edits over HDFS and access to the edits via HDFS INotify stream.

--help           -h -- Display help for this command.

--maximum-delay      -- The maximum duration until an edit event is reported in INotify.

--retention          -- The minimum duration edits will be retained.

--verbose        -v -- Display more detailed information.

--zone               -- Access zone.

 

 

Requirements:

In order to implement and OneFS and Navigator integration the minimum required versions are:

 

As of October 2nd:

OneFS 8.1.0.1 MR + Navigator DA Patch (see your account team to obtain the patch)

Cloudera Manager CDH 5.12


FSimage and INotify functionality will be integrated into an upcoming Major Release of OneFS removing the requirement for any DA patches to expose this functionality.

 

 

Best Practices

- Enable prior to Cloudera deployment. After setting HDFS root directory and before placing any user data in it.

- There is no known need to adjust the duration of HDFS FSImage and INotify jobs unless instructed to by support.


 

Conclusion

The integration of FSImage and INotify capabilities into OneFS now provides support for Isilon OneFS based Hadoop cluster deployments to provide metadata management and data lineage with Cloudera Navigator compatibility. This integration extends the enterprise capabilities of OneFS based Hadoop deployments providing parity to native DAS based HDFS file systems and data management options. Additional information on Cloudera Navigator integration with OneFS can be obtained from your account team.

 

 

 

Using Hadoop with Isilon - Isilon Info Hub

russ_stevenson

Isilon