XtremIO: XEnv panic and metadata prefetch leads to latency


   Article Number:     495083                                   Article Version: 7     Article Type:    Break Fix 




XtremIO Family,XtremIO X1,XtremIO HW Gen2 400GB,XtremIO HW Gen2 400GB Encrypt Capbl,XtremIO HW Gen2 400GB Encrypt Disable,XtremIO HW Gen2 400GB Exp Encrypt Disable,XtremIO HW Gen2 400GB Expandable,XtremIO HW Gen2 800GB Encrypt Capbl





High latency may be seen on hosts using the XtremIO array.   
    On the array, you may see alerts for XEnv failure and proactive metadata loading trigger on the array, but without any other alerts (these alerts would normally be seen in conjunction with a Storage Controller reboot, but not by themselves). The XEnv alert may be active anywhere from less than a second to several seconds before clearing:   

      2017-01-11 10:17:02,063: XtremCluster: Raised alert: "XEnv is failed." object: X2-SC1-E1 [5] severity: major threshold:        
        2017-01-11 10:17:02,142: XtremCluster: Removed alert: "XEnv is failed." object: X2-SC1-E1 [5]       
        2017-01-11 10:17:05,633: XtremCluster: Raised alert: "System performing proactive metadata loading" object: XtremCluster [1] severity: information       
        2017-01-11 11:36:30,005: XtremCluster: Removed alert: "System performing proactive metadata loading" object: XtremCluster [1]
      On the Storage Controller referenced in the alert, you may see the following entry in the messages logs during this timeframe (requires collecting a log bundle from the array. In this example, the logs would either be in /latest/X2-SC1/system/logs/messages, or in one of the files in /latest/X2-SC1/archived_logs):   
      <crit>2017-01-21 12:16:54.453349 xtrm0264-x2-n1 xtremapp: C16 [log_id: -1][81994(82092 nb_truck_7)]jr_force_destage_page:4302: PANIC <C6963> csid 16 at journal.c:4302 jr_force_destage_page (timestamp 75518036128184176): ASSERT failed on condition: cnt < JR_MAX_FORCE_DESTAGE_RETRIES. [API] jr_force_destage_page: too many spins in the force_destage loop, max_allowed=1000   






Due to a XtremIO software issue, journal cleanup doesn't wait long enough for completion before panicking on timeout during the snapshot refresh. This leads to a software module reset, and may result in latency. If latency has occurred, it should clear once the proactive metadata loading completes.                                                           






Typically occurs after a snapshot refresh operation was executed, but can occur after operations involving snapshot deletion.                                                           






Latency should clear up automatically after the issue occurs. Depending on the array workload, this may require the proactive metadata loading to finish first. If it does not clear up automatically, or other issues are present that can't be accounted for, please open a Service Request with Dell EMC Global Support.   
    This issue is fixed 4.0.25 and later.