Long distance recovery for virtualized SAP HANA made simple: what we learned from Project RUBICON
I just returned from SAPPHIRE 2014 in Orlando where SAP CEO Bill McDermott told the assembled 25,000 people that “SAP wants to run simple!” and he then proclaimed that “We can, we will, beat complexity” to help SAP customer “run simple” too.
Run simple! Can it be applied to a solution for providing long distance (as in over 100 miles) business continuance & disaster recovery (BC/DR) for SAP HANA?
In my global travel these past 12 months talking to customers, one thing is very clear: the lack of a solid, simple, and easy to understand long distance disaster recovery solution is making it difficult for customers to take SAP Suite on HANA into Production in large scale.
In my pre-SAPPHIRE blog post “Virtualized SAP HANA in Production: the momentum continues and we have crossed the Rubicon!”, I had invited those of you going to SAPPHIRE to join us on Wednesday June 4th, 2014 at 4:15PM ET in Theater 2 in the ASUGHub to hear the results of our Proof of Concept (POC) called Project RUBICON.
Deloitte, EMC, VMware, & Cisco worked closely together to prove out that you can recover virtualized SAP HANA instances at longer distance, in this case, it is 550 km (341 miles) which happens to be the distance between the Deloitte datacenter in Suwanee, GA (outside of Atlanta) and the EMC datacenter in Durham, NC.
The POC planning team felt that the results and outcome of Project RUBICON needs to be discussed more in business and applications terms and less in technical terms, and therefore, Deloitte Consulting is the key partner in this project in which the RTO and RPO metrics must be relatable to a business user.
Deloitte has an existing VCE Vblock System 300 in its datacenter in Suwanee which is used to show its clients how they could benefit from Deloitte’s Cloud ServiceFabric, a new private cloud solution based on VMware vCloud Suite, launched in May 2014. To start Project RUBICON, EMC, VMware, & Cisco built a DR site at a distance far enough to make this project ‘real world’ (we did NOT want to do any distance simulation), and we chose the EMC datacenter in Durham, located 550 km away.
With Deloitte bringing its SAP HANA application expertise to the team, the remaining stakeholders, EMC, VMware, and Cisco invested almost $2M in equipment and expertise to build the brand new Project RUBICON lab in Durham. This POC aims to demonstrate how BC/DR, a key component of any SAP application implementation strategy, is an integral part of the Cloud ServiceFabric offering.
The Deloitte team created a SAP HANA data mart running on a 512GB virtual machine (VM) supported by a SAP BOBJ (Business Objects) VM of smaller size for this POC. In addition, a SAP Business Suite on HANA virtual machine was also created to be part of the test and demo for SAPPHIRE. The ‘disaster’ was to abruptly disrupt the VMs in Suwanee where a sales report was being run in BOBJ using 250 million records of data already existing in the HANA data mart, and also disrupting a currently running data load of an additional 200 million records into the data mart.
Let’s first review the high level architecture created for Project RUBICON:
The Vblock in Suwanee is powered by Cisco UCS servers with EMC VNX5300 storage, and the POC was conducted primarily on a pair of Cisco UCS B440M2 set up in TDI mode – in other words, the VMDK files for the SAP HANA & other supporting virtual machines (VMs) reside on specific LUNs protected by a pair of Gen5 RecoverPoint appliances (RPAs).
In Durham, a V+C+E environment was built, with a pair of Cisco UCS B440M2 set up in TDI mode on a VMAX 20K, & the LUNs set up to receive the data stores from Suwanee as replicated by RecoverPoint over a VPN tunnel between the sites.
VMware vSphere 5.5 was used at both sides since that is the required version supporting virtualized SAP HANA, but it was VMware SRM (Site Recovery Manager) and its integration with RecoverPoint which was the key piece of the puzzle which provided the astounding results of this POC.
Essentially, the POC team created an “Easy Button” in VMware SRM which can be pushed (in this case, mouse-clicked) once a disaster has been declared by someone with authority to make that decision. Once the Easy Button has been pushed, VMware SRM takes over and completely automates the entire recovery & orderly restart of the VMs in Durham – this automation eliminates the needs to consult and implement complicated and error-prone DR run books. It was amazingly simple to watch the progress of the recovery on the vCenter console, without any human intervention, until we were notified on the console that SAP HANA is up and running again without any error!
So what were the results? Let’s take a look at the table below, which was shown at our ASUG session at SAPPHIRE, perhaps the first time that any concrete metrics were being discussed by anyone publicly regarding long distance BC/DR for SAP HANA.
The results of the POC, especially the RTO and RPO metrics were simply astounding!
- It took under 15 minutes for an end-to-end recovery & restart of the SAP HANA data mart (512GB VM) and the supporting SAP BOBJ portal in Durham, which is now the Production site. We (as the Business Users) were able to login and immediately attempt to rerun an existing sales report which was interrupted by the ‘disaster’
- The initial rerun of the report in Durham was slower than its baseline in Suwanee (29 seconds vs. 10 seconds) which is to be expected since the 250 million records needed to be read into SAP HANA from the Persistence Layer
- But subsequent runs reduced the report run time to 18 seconds, and eventually to the 10 seconds like in Suwanee
- Inspection of the SAP HANA data mart showed that the disrupted data load only resulted in less than 5% of in-flight data (the data in SAP HANA not yet committed to logs in the Persistence Layer) – this is remarkable given the asynchronous nature and the long distance (550km) of the data replication, and a powerful testament to RecoverPoint’s efficient data compression
- It also took under 15 minutes for the HANA developer to login to the HANA Studio to resume the data load, proving that development teams can quickly resume their work in Durham
- The SAP Basis and infrastructure team inspected every component of the V+C+E infrastructure in Durham after the automated recovery by VMware SRM, and could not find a single error or fault
- Once all the tests and inspections were completed, the team decided to have VMware SRM orchestrate a fallback to Suwanee from Durham in order to make Suwanee the Production site once again. This ‘reprotection action’ (using VMware terminology) took roughly 15 minutes to perform after VMware SRM instructed EMC RecoverPoint to reverse the direction of the data replication
- We tested the failover from Suwanee to Durham, and then the failback from Durham to Suwanee 3 times, and each time, the results were consistent! It was quick, simple, and fully automated
In my almost 18 years career in SAP Basis, I have chosen to specialize in 2 areas: application performance and disaster recovery, and I have always known that planning BC/DR is hard and costly for any SAP application, but to do it with SAP HANA which is an in-memory database posed new challenges!
One key design goal was to show that this BC/DR solution can even be implemented in a hybrid cloud scenario, and therefore, we set up a VPN tunnel between Deloitte and EMC to take advantage of both companies existing Internet connectivity. You can therefore imagine the security concerns from both parties as firewall rules were modified by the security teams for this POC, but in the end, it all worked!
It is worth noting that the initial replication of the approximately 6TB of VMs from Suwanee to Durham took about 16 hours – after that initial replication, delta replication of the logs took significantly less time. Obviously, your mileage will vary especially if a VPN is involved since you will be at mercy of the throughput of the Internet, but the cost of creating a VPN is far less than that of a dedicated leased line.
At the end of our presentation at SAPPHIRE, one customer came up and told me that he has Suite on HANA in Production since July 2013 on an appliance, but without meaningful BC/DR, and so he has been constantly worried! He stated his delight at finding a real world solution with actual metrics, and best of all, it is a solution that he can implement immediately!
I am proud to be part of a team of passionate, dedicated, and talented people from 4 great companies, Deloitte, EMC, VMware, and Cisco, who have made Phase 1 of Project RUBICON such a success, and you can get more details of our work in the PDF of our presentation below as well as in a white paper.
But I still have a lot more to share with you all about Project RUBICON, so please do stay tuned!
Tim K. Nguyen
SAP Global Technical Evangelist & EMC Cloud Architect
EMC Global Solutions Marketing - SAP