In my last blog (https://community.emc.com/blogs/keith/2016/07/28/hadoop-on-isilon--configuring-a-command-line-gateway) I outlined how to deploy Cloudera CDH on Isilon shared storage and setup a Linux CLI gateway host that allowed users to submit Hadoop jobs on a perimeter edge host.  This allows users to do their work on a non-critical gateway host that runs no significant Hadoop roles other than gateway services.  I made an assumption that I would not use Kerberos in that configuration to keep things simple.

 

Is it realistic to ignore Kerberos for a production Hadoop cluster?  No!  I'll again reference the "Cloudera Security" document (http://www.cloudera.com/documentation/enterprise/5-5-x/PDF/cloudera-security.pdf) and point out that without any authentication or authorization we do not have a secure cluster and will run at "Level 0".  With Hadoop on Isilon, implementing authentication (Kerberos) and authorization (file and directory permissions) are necessary for moving to "Level 1" and getting your Hadoop cluster production ready while taking advantage of Isilon shared storage for multiprotocol access.

 

This post will cover more abstract concepts and less "how-to" content than normal since there is already an excellent guide written by Russ Stevenson on the topic (Cloudera CDH 5.7 with Isilon 8.0.0.1 and Active Directory Kerberos Implementation).  Consider this blog post supplemental and my goal is to explain how things work.  The first few times I went through this process I was lost and had more questions than answers.  Hopefully this can be a shortcut to those who find it and will consolidate some of the background information.

 

Also, please contact Cloudera and EMC when planning this type of work  My goal is to help customers understand how Hadoop on Isilon works with Kerberos but I do not intend this as a guide to use for your production environment.  Please talk to Cloudera and EMC (or your system integrator/ partner) about planning this work and having their professional services folks help out.  It will save you time!

 

Concepts

 

Authentication

 

When you connect to a NAS share or HDFS filesystem, how does the system know if you are who you say you are?  If you are familiar with NFSv3 you will understand that asking an NFS client to announce their user ID (UID) and group ID (GID) is not secure since there is no real authentication involved.  A NAS system hosting NFSv3 shares will simply give a user access provided they present the correct UID/GID (among other things, this is oversimplified to make a point).  HDFS is similar in that it uses user and group account names instead of numbers.  So if a rogue user connected to "Level 0" Hadoop cluster as a superuser account ("hdfs", "mapred") then HDFS would give that account full access without any further authentication.

 

Kerberos is the de facto standard for authentication due to its strong security capabilities. Instead of connecting to a share as userID '501' or as user 'hdfs', the Kerberos client connects to the Kerberos Key Distribution Center (KDC) and performs an encrypted authentication exchange by means of principals and encrypted keytabs.  Do you have Active Directory in your environment?  Then you are already using Kerberos when your Windows users authenticate their accounts with a domain controller.  Active Directory not only provides directory services (list of accounts and groups centrally managed) but also uses Kerberos to authenticate all the accounts in the directory.  MIT Kerberos is an alternative to using an Active Directory KDC but MIT Kerberos is not a directory service and is usually combined with a directory provider (AD or LDAP).

 

Hadoop is typically configured to use Kerberos to provide legitimate secure authentication.  Isilon is also typically configured to use Kerberos authentication by joining the Isilon to an Active Directory domain or by configuring an MIT KDC realm (https://mydocuments.emc.com/pub/en-us/isilon/onefs/7.2.1/ifs-pub-administration-guide-gui/05-ifs-br-authentication.htm).  When running Cloudera CDH on Isilon its best to first configure and test an insecure installation first ("Level 0").  Then the planning can start to enable Kerberos for secure authentication.  Whether you are using Hadoop on DAS (direct attached storage) or Hadoop on Isilon (for shared storage), the Kerberos concept and configurations are very similar but I will highlight the differences where appropriate.

 

Authorization

 

Once you've authenticated yourself through Kerberos, what data are you authorized to access?  Think of authenticating through Kerberos as the first step and authorization as the next step.  Authorization is handled through POSIX permissions on HDFS data and access control lists (ACLs).  POSIX permissions are well known Unix type permissions on the HDFS filesystem data that you manage through 'chown' and 'chmod'. HDFS ACLs (disabled by default) allow you to implement permissions that differ from traditional POSIX users and groups.

 

HDFS ACLs are not implemented within Isilon OneFS and cannot be used with Hadoop on Isilon!  See this link (Using CDH with Isilon Storage) for a reference from Cloudera.  Isilon is a multi-protocol scale-out NAS array that supports simultaneous access to the same data over SMB, NFS, and HDFS.  So with Isilon we really don't need HDFS ACLs since we can permission the data with Windows ACLs which are very common in most IT shops.

 

Lets take our CLI gateway setup as an example (https://community.emc.com/blogs/keith/2016/07/28/hadoop-on-isilon--configuring-a-command-line-gateway).  We created a Windows user in active directory, created an HDFS home directory (in /user) for that user, and then changed the owner on that users home directory to match the active directory user ID as pulled from the Isilon.  Because the Isilon supports HDFS, NFS, and SMB access, there are already capabilities to use ACLs outside HDFS because HDFS is mounted on shared Isilon storage.  So if I want my AD user "keith" to use Unix style POSIX permissions I can set them from the Isilon side or from HDFS.  If I want to use Windows ACLs I can share this user directory out via SMB (the HDFS mount /user/keith) and manage ACLs through Windows Explorer or using the Isilon console.  So the Isilon already has ACL capabilities since its a NAS system designed for NFS and SMB access.  We can use Windows ACLs which are very common and in most cases don't need HDFS ACLs.  Bottom line, one of the great things about Hadoop on Isilon is that Windows users can access their HDFS user directory through Windows Explorer while also accessing the exact same data through HDFS on Hadoop.

 

Note, there are other ACLs in Hadoop that have nothing to do with file and directory permissions (such as job queues).  These are still supported, we are only talking about HDFS file and directory ACLs here.

 

Simple vs Kerberos Authentication

 

Hadoop (and Isilon) supports two types of HDFS authentication, simple and kerberos.  Simple authentication means there really is no authentication as we mentioned above and Hadoop "trusts" that a user is who they say they are with no verification.  You could even put a username in the HADOOP_USER_NAME environment variable to impersonate other users, there is no further verification and jobs will run as any user you like including built in superuser accounts like "hdfs" and "mapred".  A Cloudera CDH installation that has not yet had Kerberos enabled will default to simple authentication and an Isilon access zone by default will also support simple authentication.  The Isilon will also match UIDs and GIDs for HDFS data so its important to keep the HDFS accounts in sync with your external directory UIDs and GIDs on the Isilon permissions for authorization (we did this in the last blog post when setting the UID/GID in HDFS instead of simply the friendly account name). Its best to setup your Cloudera on Isilon cluster first with simple authentication to make sure everything is working before moving to Kerberos authentication.

 

Kerberos authentication is enabled both on the Cloudera cluster (Cloudera wizard to enable Kerberos) and on the Isilon (access zone setting to switch from simple authentication to Kerberos).  Once enabled all HDFS tasks require a Kerberos ticket from the user submitting the command or job.  What does this mean?  Simply put, you can't do any work without first obtaining a ticket.  If you are running jobs from a command line, that means you need to 'kinit' prior to performing any work that requires access to HDFS.  Without this ticket ('kinit') all commands and jobs will fail with a "PriviledgedActionException" error, even a simple 'hdfs fs -ls' command.

 

Simple authentication mode allows users to very easily impersonate superuser accounts (hdfs, mapred) and this happens when users submit jobs to the various Hadoop ecosystem components.  Say my user "keith" submits a map reduce job, Hadoop will run parts of the jobs as the "mapred" account by default without the user knowing or maliciously trying to impersonate that superuser account (this gets complicated, comments welcome).  When Kerberos is turned on this still happens but needs to be done securely by delegation tokens.  Why do we need this?  Say a user logs into a CLI gateway host, runs 'kinit', gets a Kerberos ticket, and submits a job using data in HDFS that they are authorized to access.  This job may run for a few minutes or a few hours.  That user may log off or destroy their Kerberos ticket but we want the job to continue to run since it was submitted securely.  So a delegation token is used which simply impersonates a user securely for a fixed amount of time (1 day) but can be renewed.  Kerberos authentication means delegation tokens will now be used between Cloudera and Isilon for an additional layer of security when users submit Hadoop jobs.

 

Proxy Users

 

Superuser accounts in Hadoop (mapred, hdfs, etc) need the ability to submit Hadoop jobs on behalf of end users and groups.   As we mentioned in simple authentication mode above, this happens behind the scenes because there simply is no authentication in simple mode.  Superusers impersonate end users who submit jobs with no authentication.  If user "keith" submits a map reduce job, the superuser account "mapred" impersonates user "keith" at times when necessary.  When Kerberos is configured the same things happens except the superuser account must obtain Kerberos credentials. Thats not a problem because the Cloudera Kerberos wizard will create these superuser accounts in active directory and authenticate them via Kerberos when necessary (you don't get the password as the Hadoop admin!).

 

When using Hadoop on DAS (again, direct attached storage) this works automatically since the HDFS filesystem is self contained within the compute nodes.  Introducing Isilon as shared storage requires an extra step since the Isilon does not automatically know which superuser accounts are allows to impersonate end users.  To enable this feature the Isilon has a concept of an Isilon proxy user to mimic this behavior (https://mydocuments.emc.com/pub/en-us/isilon/onefs/7.2.1/ifs-pub-administration-guide-cli/25-ifs-br-hadoop.htm).  When configuring a secure Kerberized configuration you need to create proxy users for every Hadoop superuser account that needs to securely impersonate end users based on the type of jobs you plan on running.  The documentation describes how to do this but don't skip it, your jobs will fail!

 

Putting it all together

 

So how do we take our "level 0" Cloudera cluster with a CLI gateway (one last time -> https://community.emc.com/blogs/keith/2016/07/28/hadoop-on-isilon--configuring-a-command-line-gateway) and secure it with Kerberos?  Its not that difficult now that we've explained all the info above and since we already know things are working without Kerberos.

 

Recap of our environment

 

- CDH 5.5.4 running on four (4)  CentOS 6.8 VMs

- OneFS 7.2.1.1 running on three (3) virtual Isilon nodes

- One (1) Windows 2008 R2 host acting as a domain controller, DNS server, and certificate authority

 

Our CDH cluster nodes are:

 

- One (1) Linux VM running Cloudera Manager

- Two (2) Linux VMs running all CDH Hadoop roles (master nodes / compute nodes)

- One (1) Linux VM acting as our CLI gateway running only gateway roles, this is where users submit their Hadoop jobs via the CLI

 

Screen Shot 2016-07-14 at 8.51.20 PM.png

 

Our Kerberos setup will use the Windows domain controller as the Key Distribution Center (KDC) and we will not use MIT in this example (although you could).  Our steps will look like:

 

Configure SSSD on all nodes

 

If you remember we had SSSD and all its required packages installed and configured on our CLI gateway host so that active directory users could login to the Linux host with their AD credentials and submit Hadoop jobs.  Substitute your favorite commercial package for SSSD if you like.  We don't need SSSD on the Cloudera Manager host. 


Why do we need SSSD on all hosts and not just the CLI gateway host? With all the info we've covered in this post regarding simple versus kerberos authentication, we now know that even though we were submitting jobs on the CLI gateway host as an AD user, the jobs were subsequently using superuser accounts to run without authenticating and without our knowledge.  The Hadoop nodes running core services had no knowledge of our Active Directory accounts, they were using superuser accounts like "mapred" and "hdfs".  However, if we want to use Kerberos we need every node in the Hadoop cluster (except the Cloudera Manager host) to recognize all of our directory accounts (AD accounts) so we must install our SSSD package everywhere and configure. 


See the my last blog post on how to configure SSSD, you can most likely copy the .conf files to your other hosts and start the services without much trouble. 

 

Create Isilon proxy users for Hadoop superusers and add end users

 

Just follow the process described in the link (Using CDH with Isilon Storage).  The process is straightforward, just create the proxy superusers with the commands given and then add your Hadoop users.  Better yet, put your Hadoop users in groups and assign groups to the proxy users (less work in the long run).  At this point you can't nest groups within groups so just keep things simple.  You will need to know what AD users need to run Hadoop jobs and make sure you permission their HDFS /user home directory with the correct UIDs/GIDs are from the Isilon perspective ('isi auth mapping token...'). 

 

Follow the Cloudera documented process for enabling Kerberos through the wizard (Enabling Kerberos Authentication Using the Wizard).


Note that OneFS 8.0 and up supports AES-256 encryption but OneFS 7.x does not!  Not really a problem if you follow the default steps and are ok with other encryption types, just use OneFS 8.0 if you need AES-256 explicitly.  You will also need to create an AD user preferably in a new active directory organizational unit (OU) that has the ability to "Create, Delete and Manage User Accounts".  The documentation describes this well so just follow Cloudera's process.


Note, you may stall during the Cloudera Kerberos wizard when starting up the Hue service.  Again, follow the step by step process in this excellent blog (Cloudera CDH 5.7 with Isilon 8.0.0.1 and Active Directory Kerberos Implementation) to get past that error, it contains a link in the post that specifically addresses how to start Hue.  The overall process is very well documented and can be used for the entire process of getting Kerberos enabled on CDH on Isilon (great job Russ!).

 

Test with your AD users

 

Now the moment of truth!  Log into your CLI gateway host as your AD user ("keith" in my case) and try to run an 'hdfs' command.  It will fail of course since we did not obtain a Kerberos ticket (if it succeeds there is something wrong or you have a Kerberos ticket cached).  Before you can do anything on the secure cluster you must obtain a ticket from the KDC (our Windows domain controller) so we need to run the 'kinit' command.

 

-sh-4.1$ kinit keith@KEITH.COM

Password for keith@KEITH.COM:

-sh-4.1$ klist -e

Ticket cache: FILE:/tmp/krb5cc_710801189

Default principal: keith@KEITH.COM

 

Valid starting     Expires            Service principal

08/04/16 08:40:10  08/04/16 18:40:11  keith/KEITH.COM@KEITH.COM

  renew until 08/11/16 08:40:10, Etype (skey, tkt): arcfour-hmac, aes256-cts-hmac-sha1-96

-sh-4.1$ hdfs dfs -ls /user/keith

Found 3 items

drwx------   - keith domain users          0 2016-08-04 08:28 /user/keith/.staging

drwxrwxrwx   - keith domain users          0 2016-08-04 08:25 /user/keith/QuasiMonteCarlo_1470313498632_1021303185

 

Success!  Now to submit a test job.  Our Kerberos ticket is valid for a fixed time period so we don't have to run 'kinit' every time, we can continue to use the same ticket.  I'm going to submit a sample teragen job as my user "keith".  Notice the creation of the delegation token (as discussed above).


-sh-4.1$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000 /user/keith/out

16/08/04 08:40:39 INFO client.RMProxy: Connecting to ResourceManager at slavetwo.keith.com/192.168.0.20:8032

16/08/04 08:40:39 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 17 for keith on 192.168.0.54:8020

16/08/04 08:40:39 INFO security.TokenCache: Got dt for hdfs://cloudera.isilon.keith.com:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.0.54:8020, Ident: (HDFS_DELEGATION_TOKEN token 17 for keith)

16/08/04 08:40:39 INFO terasort.TeraSort: Generating 1000 using 2

16/08/04 08:40:39 INFO mapreduce.JobSubmitter: number of splits:2

16/08/04 08:40:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470244871757_0028

16/08/04 08:40:40 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 192.168.0.54:8020, Ident: (HDFS_DELEGATION_TOKEN token 17 for keith)

16/08/04 08:40:40 INFO impl.YarnClientImpl: Submitted application application_1470244871757_0028

16/08/04 08:40:40 INFO mapreduce.Job: The url to track the job: http://slavetwo.keith.com:8088/proxy/application_1470244871757_0028/

16/08/04 08:40:40 INFO mapreduce.Job: Running job: job_1470244871757_0028

16/08/04 08:40:56 INFO mapreduce.Job: Job job_1470244871757_0028 running in uber mode : false

16/08/04 08:40:56 INFO mapreduce.Job:  map 0% reduce 0%

16/08/04 08:41:07 INFO mapreduce.Job:  map 50% reduce 0%

16/08/04 08:41:13 INFO mapreduce.Job:  map 100% reduce 0%

16/08/04 08:41:14 INFO mapreduce.Job: Job job_1470244871757_0028 completed successfully

16/08/04 08:41:14 INFO mapreduce.Job: Counters: 31

  File System Counters

  FILE: Number of bytes read=0

  FILE: Number of bytes written=229962

  FILE: Number of read operations=0

  FILE: Number of large read operations=0

  FILE: Number of write operations=0

  HDFS: Number of bytes read=164

  HDFS: Number of bytes written=100000

  HDFS: Number of read operations=8

  HDFS: Number of large read operations=0

  HDFS: Number of write operations=4

  Job Counters

  Launched map tasks=2

  Other local map tasks=2

  Total time spent by all maps in occupied slots (ms)=11587

  Total time spent by all reduces in occupied slots (ms)=0

  Total time spent by all map tasks (ms)=11587

  Total vcore-seconds taken by all map tasks=11587

  Total megabyte-seconds taken by all map tasks=11865088

  Map-Reduce Framework

  Map input records=1000

  Map output records=1000

  Input split bytes=164

  Spilled Records=0

  Failed Shuffles=0

  Merged Map outputs=0

  GC time elapsed (ms)=67

  CPU time spent (ms)=640

  Physical memory (bytes) snapshot=227663872

  Virtual memory (bytes) snapshot=3020079104

  Total committed heap usage (bytes)=121634816

  org.apache.hadoop.examples.terasort.TeraGen$Counters

  CHECKSUM=2173251765740

  File Input Format Counters

  Bytes Read=0

  File Output Format Counters

  Bytes Written=100000

 

Success again!  We now have a secured Cloudera cluster communicating with Isilon shared storage both using Kerberos!

 

Still a CLI gateway host?

 

With our simple authentication setup our Active Directory users could only log into the CLI gateway host to submit jobs because the SSSD package was not deployed anywhere else.  Kerberos requires that we deploy SSSD everywhere if we want to integrate with AD.  Which means that now users can log into any of our Hadoop hosts and submit jobs (if they have ssh access of course).  After enabling Kerberos its probably a good idea to secure the critical Hadoop hosts running core CDH services to avoid user access and only allow users to access the CLI gateway host.

 

Thanks for reading, comments welcome!