We have likely all seen this error when starting a kerberized service from Ambari with AD based Kerberos and LDAP or even with a KDC implementation. The most common service this is seen on is the YARN Application Timeline Server. If this service fails then many other service will also fail and you are unable to use any services dependent on the YARN. In this blog I'll explore some of reasons you may see this error and how to correct it.

 

Since these issue are related to kerberized WebHDFS call and many issues are related to authentication and authorization, often no log entries are found in the hdfs.log, it basically never gets that far so it does not leave any trace in hdfs. Basically it fails before it ever gets to hdfs, so other log files and method of troubleshooting must be used. Some of these issue are fully beyond the scope of this blog, additional documents and content should be consulted to address the issue; docs, whitepapers, KB and other blog posts. Most AD, KRB5 and LDAP environments are different and often highly complex and each setup will likely need additional configuration to mett those specific requirements also.

 

Version of OneFS this post is applicable to: 8.0.x, 8.0.0.x

 

 

You attempt to start a service and the error looks something like this, so what could the error be?

 

1.png

 

The erros usually looks like a failed WebHDFS curl call to the Hadoop root file system:

  File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 305, in _get_file_status

    list_status = self.util.run_command(target, 'GETFILESTATUS', method='GET', ignore_status_codes=['404'], assertable_result=False)

  File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 210, in run_command

    raise Fail(err_msg)

resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X GET --negotiate -u : 'http://rip2-horton1.foo.com:8082/webhdfs/v1/ats/done?op=GETFILESTATUS&user.name=hdfs'' returned status_code=401.

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>

<title>401 Authorization Required</title>

</head><body>

<h1>Authorization Required</h1>

<p>This server could not verify that you

are authorized to access the document

requested.  Either you supplied the wrong

credentials (e.g., bad password), or your

browser doesn't understand how to supply

the credentials required.</p>

</body></html>

 

 

This blog is a high level overview of potential causes of this issue and many of configuration requirements are covered in other documents and Isilon bog posts. Many of these issue will also be covered in the upcoming Isilon and Hadoop Troubleshooting guide(TSG). I'll add a link once published as it will address many of these exact issues.

 

 

Here are the potential configuration points to check and validate to resolve this issue in no particular order.

 

 

1. Incorrect SPN's

When utilizing an AD provider, validate the following three required SPN's.

 

hdfs/isilon.clustername.fqdn

hdfs/smartconnectzonename.fqdn

HTTP/smartconnectzonename.fqdn

 

review this post for additional info: Ambari HDP with Isilon 8.0.0.1 and Active Directory Kerberos Implementation

 

While a KDC based setup only requires these two SPN's:

hdfs/smartconnectzonename.fqdn

HTTP/smartconnectzonename.fqdn

 

 

 

2. AD Duplicate SPN's

Duplicate SPN's with Isilon AD Kerberos and Hortonworks prevent services from starting

 

If AD has duplicate SPN's, they need removing. Prior to Windows Server 2012 AD let Ambari generated & Isilon Cluster SPN's exist concurrently, leading to duplicate SPN's. Server 2012+ will not allow this, but you could end up with the required SPN's in the wrong place in AD. see #3 below

 

Regeneration of SPN's will require you to re-address this issue likely.

 

 

3. SPN's in the wrong place

Since Ambari creates Isilon SPN's in the OU delegated for Ambari principals, these need removing and the Isilon specific SPN's must exist on the Isilon Cluster Computer Object in AD, these can be created and managed from OneFS or AD directly. But the required SPN's must exist in the correct place in AD and you have to remove the duplicate.

 

Regeneration of SPN's will require you to re-address this issue likely.

 

The Isilon specific SPN's should exist here on the Isilon AD Computer Object:

2.png

 

 

4. KeyTab version mismatches between the KDC and Isilon

Since both Ambari & Isilon can generate SPN's, the order of SPN creation is critical.

 

If SPN's were generated by the Ambari kerberos process, the keytabs on the Isilon may be outdated and incorrect. The required Isilon KDC SPN's must be generated after Ambari(the Ambari ones may need removing first) so that the Isilon has the latest KVNO version of the keytab. Simple rule here is Isilon must be the last to generate a keytab, so it has the correct version of the KVNO keytab.

 

Regeneration of SPN's will require you to re-address this issue likely.


 

5. Incorrect permissions on the Isilon's krb5.conf

KDC Kerberized Yarn Services Fail to Start on 8.0.1 with Ambari via WebHDFS curl calls

 

A known issue may be observed if the Isilon is update from 8.0.0.x code to 8.0.0.1, where the Isilon krb5.conf is incorrectly repermissioned by the upgrade and the WebHDFS services can no longer read this file, see the above post for additional info.

 

 

6.Mapping rules exist on Isilon creating

Let's say the Access Zone has a general mapping for all AD accounts to all other accounts with the same name:

e.g.: DOMAIN\* &= * []

 

This will force mappings between vlab\hbase & hbase for example

 

15.png

 

When we look at the local user access token; we see the local hbase account

12.png

 

But the hbase UPN lookup has an odd looking name.

13.png

 

In this case the hbase UPN is being looked up in AD but we are seeing the SAMAccount returned as the name as it was created by Ambari in this way, this creates an issue and we will need to fix it to force the local account to equate to the AD UPN to get a valid user map and the correct id. See issue 7 below for additional details.

 

 

7. The SAMAccount name issue

When Ambari generated the SPN's & UPN's in the AD OU, it will create a valid UPN/SPN for each principal. But, AD requires the SAMAccount Name field(PreWindows200 name) also be populated. In order to meet the AD requirement, Ambari uses a randomly generated string to populate this field. Seen below for the ambari-qa account.

 

3.png

 

In this situation, Isilon cannot map the existing local ambari-qa account to the AD based ambari-qa account, as the lookup is using the SAMAccount name. It will see the account but it doesn't map them correctly.

 

This can be seen when viewing an access token for the shortname vs. UPN. These users appear as different users and the AD based SPN does not resolve to the identity of the existing local account that was previously used and permissioned against.

 

SAMAcount Names, that do need changing:

hdfs

ambari-qa

yarn

<other accounts may need this fix also; hive, hbase, etc. depending on services installed>

 

You  will likely have to redo this when principals are regenerated at any point:

 

 

1.The local ambari-qa account with the valid UID:513 < -- just the local user info

4.png

 

 

2.The AD based UPN ambari-qa account, that is now using an auto-generated UID, as it is not correctly mapped back to local ambari-qa account < -- just the AD user info

5.png

 

 

In order to facilitate the correct mapping, we need to modify the AD based SAMAccount name to be equal to the local account name.

 

6.png

 

Additional mapping rules maybe required but without a valid SAMAccount name we will lookup and mapping issues.

14.png

 

 

8. Legacy ID mapper entries

Since prior to Kerberization, local accounts were used without an AD equivalent, or AD account SID's have incorrect locally generated UID's due to bad mappings or incorrect lookups, the local id map database may contain mappings for these users id's, some maybe incorrect and need cleaning out to avoid incorrect mappings of SID:UID and UID:SID.

 

Review the mapping with: isi_classic auth mapping dump or isi_classic auth mapping dump | grep <ID>

10.png

 

Use isi auth mapping delete to cleanup bad mappings as required.

 

 

9. SPN case is incorrect

Basically you typo'd it!

 

hdfs   - lowercase

HTTP - uppercase

 

 

 

10. User lookup of the AD UPN account fails outright

1. Shortnames work  (in this case the hdfs >= root mapping kicks in and hdfs is replaced by root), but this could be for any account

8.png

 

2.UPN fails outright (we need hdfs@domain to also map to root in this case) or yarn = yarn@domain

7.png

 

This can be caused by issue 6 or 7 above, a generic mapping does not exist and bad SAMAccount name or the lack of user mapping rules.

 

 

 

11. Issues with permissions on the /ats and /ats/done folder

I have seen issues where the curl calls wish to chown & chmod the /ats and /ats/done folders in the hdfs root. One potential workaround is to set the required permissions and ownership from Isilon and restart or delete the folders and let the service attempt to restart them

 

 

 

12. Map User into Primary Domain

This may be required to map short names into the AD username, it may depend on the AD environment in use.

 

 

 

13. Provider not pulling assigned UID & GID's from AD

Is rfc2307 enabled on the AD provider?

 

Also make sure indexing and GC replication per the Isilon KB: https://support.emc.com/kb/335338

 

 

 

14. Tokens for shortname and UPN are not equal

 

With valid mappings and user lookups, local token versus UPN will create an equivalent token. Additional mapping rules may be required to get Primary Group identities correct(not covered here).

 

With valid mapping rules, SAMAccount name fixes, id mapping db cleanups are implemented; shortname lookups versus UPN create equivalent tokens.

14.png

 

 

15. Are the correct providers attached to the correct Access Zones?

Did an AD provider get created and never added to the Hadoop access zone?

 

 

 

16. Are the mapping rules on the correct providers in the correct Access Zones?

Are the mapping rules on the zone correct or invalid? Test and adjust as needed.

 

 

17. Other...

To be continued...maybe! What are your fixes?

 

update: 4/5/2017

Adding 18 here, can't stress enough the requirements of DNS being correct and accurate! Both forward and reverse.

 

 

 

18. Fully validate DNS

Validate all DNS is fully functional and all records are correct. This includes:

-- All hosts in the compute cluster have forward A and reverse PTR records

-- Isilon Smartconnect Name Delegation is correct, NS record

-- All IP's in the pool assigned to the zone have a PTR record

 

All clients in the computer cluster should be able to resolve all hostnames, smartconnect zone name and reverse IP lookups.

 

Issues with reverse DNS may be more likely to be seen with WebHDFS as it relies on SPNEGO, you can likely execute successful hadoop kerberized rpc calls# hadoop fs -ls /  but webhdfs calls fail with 401 errors.

 

 

Hopefully this list provides some additional guidance on determining and troubleshooting Kerberized Service start issues.

 

 

Other configuration to validate and check if you continue to see issues:

Classpath and ip_false settings in Ambari - Ambari HDP with Isilon 8.0.0.1 and Active Directory Kerberos Implementation

All DNS is valid for host A records and SmartConnect delegation

DNS PTR's on hosts and SmartConnect IP's

User Parity has been implemented Isilon and Hadoop Local User UID Parity

Create home directories for any service accounts, some service accounts require home directories to start and depending on how the users were created they might be missing.

 

 

 

 

Using Hadoop with Isilon - Isilon Info Hub

russ_stevenson

Isilon