|Article Number: 531450||Article Version: 3||Article Type: Break Fix|
Data Domain,DD3300 Appliance
A DD3300 is an instance of the DDVE (Data Domain Virtual Edition) running on top of a local VMware ESXi host, running on a bespoke DELL PowerEdge Server platform.
During a DDOS upgrade, at some point the upgrade code needs to contact the underlying ESXi hypervisor to deploy some upgrades to VMware itself. If this connection from the DDOS upgrade code to the underlying ESXi host fails for some reason, then the upgrade process fails to proceed. As this occurs at a time during the upgrade when the FS process has already been shut down, the DD remains unavailable for backups until manual intervention takes places to workaround the problem.
During the upgrade it is expected the DDOS upgrade code to be able to seamlessly connect to the underlying ESXi hypervisor, if this doesn't occur, the code doesn't expect this failure condition, and instead of bailing out and re-enabling the filesystem, the upgrade sits there waiting forever for a connection to happen, which will not occur. This can be seen in the DDOS upgrade logs:
# log watch debug/platform/infra.log
03/18 12:01:21: dd_upgrade[3157-0x630720]: [upgrade_progress_log] PROGRESS node: 0 percent: 59 phase: Install wait: none03/18 12:01:21: dd_upgrade[3157-0x630720]: [upg-info] precheck check S10_vmappliance starting2019-03-18 12:01:23,686 package_install INFO : Starting new HTTPS connection (1): 169.254.1.1
The file will not show more output after that in the output sample above, indicating the upgrade code last tried to make a HTTPS connection, which is the one that failed and prevented the whole upgrade process from moving on. Listing the processes running on the DD it can be seen some looking like the following ones, indicating where the upgrade process has gone stuck (from SE mode):
# se ps -efaUID PID PPID C STIME TTY TIME CMD...root 5883 3157 0 12:01 ? 00:00:00 /bin/sh /secondary/var/upgrade/post_upgrade_newimage_scriptroot 5891 5883 0 12:01 ? 00:00:00 /usr/bin/perl -w /tmp/var/upgrade/run-tests post_install /tmp/var/upgrade/precheck NEWroot 5928 5891 0 12:01 ? 00:00:00 /bin/sh /tmp/var/upgrade/precheck/S10_vmappliance post_install /tmp/var/upgrade/precheck NEW 5891root 5930 5928 0 12:01 ? 00:00:00 python /secondary/ddr/bin/install_package update /secondary/ /secondary/ddr/firmware/vulcan/DD3300_PAYLOAD.tar.gz
The issue is being investigated, and it will likely result in changes made to the upgrade script so that it bails out cleanly in the case of a failure to connect to ESXi, or a pre-check being performed up-front to not proceed with the upgrade in the first place if such a connection will not succeed later on.
In the meantime, for customer who have hit this problem while upgrading, the only viable workaround is to hard reboot the DD3300, and to try the upgrade again after the reboot has completed. Note a DD "system reboot" will not succeed, as the command will not work while the DD is being upgraded, and a hardware reboot will be necessary instead.
If a customer using a DD3300 is planning to upgrade DD OS and until a release with additional checks is available, rebooting the DD3300 prior to starting the upgrade will make sure the upgrade doesn't get stuck (as the internal connection to the ESXi will succeed, due to the necessary service being fresh started after the reboot).
Proactively rebooting the hardware is in any case part of the recommended procedure to upgrade DD OS as stated in the release notes, so customers strictly following the upgrade recommendations will avoid this potential problem during upgrades.