Data Domain : Shutdown timeout when trying to disable FS process on DDOS 6.0.x versions


   Article Number:     539528                                   Article Version: 2     Article Type:    Break Fix 




Data Domain,DD OS 6.0





There is a defect in the code for DDOS 6.0.x releases by which, when shutting down the filesystem (FS) process (either for maintenance, when restarting the FS process or as part of a DDOS upgrade), a shutdown timeout may occur which, depending on some other factors, may cause significantly extended downtime.                                                           






From DDOS 6.0 onwards, there is a Physical Locality Repair background task in within the FS which, compared to traditional LR (Locality Repair), tends to be faster and more efficient in addressing the locality issues some customer data may experience. Locality is a measurement of how backup files' data on disk is stored regarding physical affinity of segments which correspond to adjacent parts of backup images. Poor locality (that segments for adjacent file offsets are physical apart) is bad for performance, and hence (physical) locality repair does an effort to repack any such segments closer together for improved read performance.   
    The problem is with DDO 6.0.x the PLR thread may get stuck when locality is particularly bad (and hence needed the most), but while this causes no immediate effects, it causes the FS process to fail shutting down in a timely fashion when requested. And on top of this, due to the locality repair thread not making progress, locality for badly scattered data on disk, is never relocated for better performance.   






This defect in PLR has been resolved in the code, please see before for the releases:   

  •         DDOS 6.0.x has not been fixed yet as of December 2019, however, Engineering is considering a backport     
  •         DDOS 6.1.2.x has the fix, so both PLR will run as designed and will not cause a FS shutdown timeout     
  •         DDOS 6.2.x is not affected by this defect     

      There is no proactive action available to avoid this issue if the system is affected by it, however, it may be possible to check if the FS is likely to time out during shut down if the number of enqueued PLR jobs does not increase over time, as that would be a sign some earlier (daily) job has got stuck and not moving ahead, for example (from ASUPs):   

GENERATED_ON=Sun Nov 24 06:06:36 CET 2019VERSION=Data Domain OS 06:06:40 up 212 days, 17:50,  0 users,  load average: 16.70, 14.51, 9.37Filesystem has been up 212 days, 17:48.    
PLR Global Stats----------------                    PLR last container id repaired:            423443721                        PLR number of enqueued jobs:                    3                      PLR number of successful jobs:                    1                          PLR number of failed jobs:                    0                   PLR number of job repair type T0:                    3                   PLR number of job repair type TP:                    0            PLR number of job repair type TO and TP:                    0                      Number of high priority files:               285276     

      So despite the PLR to kick off a job every day and the FS to have been running for 212 at the time of the ASUP captured, only three jobs have been enqueued during this time.     
      If this is the case, you may be up to some additional waiting time while the FS process times out during shutdown, and generates a core file, if the FS process hits this issue. The logs would in that case show the following during shutdown:   

#### PLR subsystem being shut down12/09 12:19:50.305 (tid 0x48aa350): SYSTEM_SHUTDOWN: ===== Starting shutdown <PLR> - time (1575890390) =====#### Little to no chatter for a few minutes, then some "Signal" entries in platform/ and timeouts in Finally, the FS process would start up on its own again12/09 16:48:21.085 (tid 0x48aa350): NOTICE: file descriptor rlimit is set to 2048    

      If the FS was being shut down as part of a DDOS upgrade process, chances are the FS process will not come up again, not even after a DD reboot:   

### From platform/infra.log, we can see the FS process is to be shut down12/09 12:19:38: dd_upgrade[30557-0x52dc90]: [upgrade_progress_log] PROGRESS node: 0 percent: 27 phase: Install wait: ddfsDec  9 12:19:38 DD-HOSTNAME ddfs[22807]: NOTICE: MSG-DDR-00003: Shutting down ddfs#### Shutting down the FS process eventually times out and the upgrade is failed, leaving the DD NVRAM in FAILED state12/09 12:31:40: dd_upgrade[30557-0x52dc90]: [_upg_disable_ddfs]12/09 12:31:40: dd_upgrade[30557-0x52dc90]: [_upg_disable_ddfs] [UPGRADE]: DDFS is now disabled!12/09 12:31:40: dd_upgrade[30557-0x52dc90]: [upg-err] [2] NVS state SHUTDOWN is not clean12/09 12:31:40: dd_upgrade[30557-0x52dc90]: [upg-err] [3] Filesystem shutdown nvram clean check: FAILED12/09 12:31:40: dd_upgrade[30557-0x52dc90]: [upgrade_progress_log] STATUS node: 0 stage: none status: filesys_shutdown_error message: "" end: 157589110012/09 12:31:41: dd_upgrade[30557-0x52dc90]: [upgrade_post_alert] post_alert request:  alert id=0 version= 12:31:41: dd_upgrade[30557-0x52dc90]: [upgrade_clear_alert] clear alert id=2 version= 12:31:43: dd_upgrade[30557-0x52dc90]: [upgrade_run_stage] Completed stage shutting_down_filesystem [filesys_shutdown_error]    

      As a result, it may be the case the FS process can't be enabled, not even manually, until DELL EMC DataDomain Support takes further action.