Connectivity/path loss issues due to Emulex driver ELS timeouts/ BLS timeouts and abort handler timed out  waiting for aborting I/O.

           

   Article Number:     530970                                   Article Version: 2     Article Type:    Break Fix 
   

 


Product:

 

XtremIO Family

 

Issue:

 

 

Connectivity/path loss issues due to Emulex driver ELS timeouts/ BLS timeouts and abort handler timed out   
     waiting for aborting I/O.
                                                           

 

 

Cause:

 

 

 Multiple servers in this OEL 7.5 with UEK4 kernel, 4.1.12-124.17.2.el7uek.x86_64 (SMP) environment with identical   
     builds (kernels vary slightly though) were all exhibiting the same behavior, raising the following Emulex driver   
     error stacks, indicative of a very unstable fabric layer.   
   
     The following kernel errors were raised in the /var/log/messages files   
   
        lpfc 0000:81:00.0: 2:(0):0748 abort handler timed out waiting for aborting I/O (xri:x1940) to complete: ret 0x2003, ID 0, LUN xxxxx   
        lpfc 0000:03:00.0: 0:1303 Link Up Event x3 received Data: x3 x0 x80 x0 x0 x0 0   
        lpfc 0000:03:00.0: 0:(0):2858 FLOGI failure Status:x9/x50000 TMO:x14 Data x40800 x0   
        lpfc 0000:03:00.0: 0:1305 Link Down Event x4 received Data: x4 x7 x810114 x0 x0   
        lpfc 0000:03:00.0: 0:1303 Link Up Event x5 received Data: x5 x0 x80 x0 x0 x0 0   
        lpfc 0000:03:00.0: 0:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x40800 x0   
        lpfc 0000:03:00.0: 0:(0):0203 Devloss timeout on WWPN 51:4f:0c:50:3d:a7:2b:04 NPort xcab800 Data: x0 x8 x3   
        lpfc 0000:81:00.0: 2:3334 Delay fc port discovery for 10 seconds   
        lpfc 0000:03:00.0: 0:(0):0127 ELS timeout Data: x4 xfffffe x8a x7c2   
        lpfc 0000:03:00.0: 0:0449 lpfc_fdmi_on attribute cannot be set to 5, allowed range is [0, 1]   
        lpfc 0000:81:00.0: 2:2538 Received frame rctl:BLS abort (x81), type:BLS (x0), frame Data:81b90c61 00b900c0 00090000 c4000000 1660ffff 00000000 00000000   
        lpfc 0000:81:00.0: 2:(0):0929 FIND node DID Data: xffff883e267f6c00 xb900c0 x80000000 x7000802 ffff883e258eac00   
        lpfc 0000:81:00.0: 2:(0):1200 Send BLS cmd x85 on oxid x1660 Data: x5   
        lpfc 0000:81:00.0: 2:(0):0748 abort handler timed out waiting for aborting I/O (xri:x1d8) to complete: ret 0x2003, ID 0, LUN xxxxx   
        lpfc 0000:81:00.0: 2:(0):0749 SCSI Layer I/O Abort Request Status x2003 ID 0 LUN xxxxx   
        lpfc 0000:81:00.0: 2:(0):0748 abort handler timed out waiting for aborting I/O (xri:x100) to complete: ret 0x2003, ID 0, LUN xxxxx   
        lpfc 0000:81:00.0: 2:(0):0749 SCSI Layer I/O Abort Request Status x2003 ID 0 LUN xxxxx   
   
        The descriptioni of the following Broadcom lpfc driver error raised:   
   
            lpfc 2538 Received frame rctl:BLS abort (x81), type:BLS (x0), frame   
                      Data:81bc0e83 00bc0620 00090000 14000000 1b88ffff 00000000 00000000   
   
        was as follows:   
   
            elx_mes3154: BLS ABORT RSP failed, data <value/value>   
            DESCRIPTION: The driver issued BLS ABORT Response failed to complete.   
            DATA: I/O status and I/O reason   
            ACTION: Contact Broadcom technical support   
   
     Due to these connectivity/path loss issues that were being raised due to Emulex driver ELS (Extended Link Service)   
     timeouts/ BLS (Basic Link Service) timeouts and abort handler timed out waiting for aborting I/O, causing the   
     applications layer to abnormally terminate due to the excessive long I/O wait service times at the kernel I/O   
     driver layer due to excessive queuing of I/O as a result of need to redrive the aborted I/Os.   
     
                                                           

 

 

Resolution:

 

 

  The root cause of this issue had to do with the lpfc driver getting hung in the abort handling logic block.   
      The Abort handler is called and the abort handler subsequently times out with a return Status = x2003 as there is an    
      IOCB timeout condition reported back to the SCSI kernel layer.   
   
      Bug 28733055 [1309:KERNEL:DRV-BROADCOM:Status 11] with the Oracle OEL UEK engineering team was opened for this    
      issue with servers continually losing connection with the XtremIO storage arrays.   
   
      Broadcomm has identified a fix in abort handling, which was submitted to the upstream UEK5 kernel and was   
      backported to the UEK4 kernel.   
   
      The specific Broadcom patch backported is to address the  issue:   
         
                   0004-scsi-lpfc-Fix-list-corruption-on-the-completion-queu.patch.   
   
      The issue addressed by this patch was that the abort handler does not hold the sli4 ring lock while   
      validating the IO so that the IO could complete while the driver is still preping the abort.  The erroneously   
      generated abort, when it completes, has pointers to the original IO that has already completed, and the IO   
      manipulation (for the second time) corrupts the list.   Most systems will run for a while after the corruption   
      occurs, but will eventually crash when a scsi eh reset occurs and the txcmplq is attempted to be flushed.   
      The flush gets stuck in an endless loop.   
   
      Correct by taking the ring lock early in the abort handler so the erroneous abort won't be sent if the io has/is   
      completing.   
         
      At the present time, Oracle is in the process of officially releasing the patched kernel for deployment at General Motors.   
      Contact Oracle UEL UEK Support referencing bug # 28733055 to obtain the patch.   
   
      Inclusion into the product kernel is targeted for the UEK-4.1-QU7 release.