My fellow blogger @Noel Wilson recently posted part one of a three-part series that hopes to explain how EMC observes the Private, Hybrid, and Public scenarios that our customers deal with every day.  Unlike Noel's excellent overview of Cloud scenarios, this blog entry is part two of my multipart series that focuses, not on constructs, concepts, and scenarios, but on the actual tires on the proverbial road.  Back on the 11th of September, I posted a blog about "Corporate Availability.  I mentioned that the white paper would be published soon.  Well — it's out and it's more detailed than even I expected.  It's so detailed that we will be releasing a series of video demos to explain all of the aspects the white paper covers. Just in case someone forwarded this blog to you and the links got stripped-out, the paper is on emc.com (http://www.emc.com/collateral/white-paper/h13360-cont-avail-ms-apps-wp.pdf).  So here is a brief summary of what you'll find in this mammoth white paper:

 

Microsoft Business ApplicationsEMC TechnologiesVMware Technologies

Microsoft Exchange Server 2013

Microsoft SharePoint Server 2013

Microsoft SQL Server 2012

EMC VNX

EMC RecoverPoint

EMC VPLEX Metro

EMC VNX

EMC RecoverPoint

EMC VPLEX Metro

Here is the basic visual representation of what the paper explains:

Screen Shot 2014-10-03 at 1.10.38 PM.jpg

The idea is that a total of three data center (DC) locations will be deployed -- two within 60-miles of each other (the Primary pair), the third (Tertiary) is as far away as possible and feasible -- a typical separation between Primary and Tertiary data centers would be 600-900 miles (about 1000km-1500km).  The two "near proximity" data centers will handle the production load in a shared resources model -- either data center can use the resources of the other and workloads can migrate at a moment's notice from one DC to the other.  The third (tertiary) data center is only put into production when the pair of Primary DCs is offline for network, electrical, or physical reasons.  Please note that the tertiary data center will be "moments behind" (in replication terms) from the two Primary data centers, so Reporting, test, development, analytics, cube builds, and the like could all be housed and operational at the tertiary location.  There is one additional workload that would be running at the tertiary data center by default: Backup!  Operational backups should always be made in the building that DID NOT just catch fire...  If backups can be taken AT the offsite location, all the better.  Why backup, then replicate?  Why not replicate, then backup?

In this three site configuration, there are two sites coupled via Ethernet — and the Fibre Channel Fabric is extended between these two Primary sites — there are several ways to accomplish this (none of which are expensive or complex here in the year 2014...) -- and of course the Ethernet is also extended to both Primary sites.  The third site, however, is only connected via Ethernet as there is no need to extend Fibre Channel over a distance of 900-miles!.  Your exact bandwidth requirements will be set based on the actual traffic that is "write oriented" -- only the changed blocks ever travel to the third site (under normal operations).  Only when the third site is brought into Production, does the read traffic ever "back flow" through the Ethernet to either site A or Site B — unless of course you have a tertiary iNet point of presence at that third site for use during failures.  In that case, the tertiary site can me come the sole remaining resource (furnishing both Primary workloads as well as all backup infrastructure).

The two Primary data centers would be connected like this:

Screen Shot 2014-10-03 at 1.20.26 PM.jpg

Anyway… the scenarios that the paper explain are:

  • A single VM failure (and its recovery)
  • A vSphere host failure (and the resulting recovery)
  • An entire Storage Array outage (like it got hit by a swinging gorilla or backhoe — hey, it could happen!!) and the resulting recovery
  • An entire site failure (like it got taken away by an alien space ship — or backhoe) and the resulting recovery
  • An entire regional outage that takes both primary data centers offline — like a massive grid failure (unfortunately these actually do happen) and the resulting recovery

In all but the last scenario, recovery of services happens within minutes — actually, that's not true… in ALL scenarios, recovery happens within minutes — it's just that the dual DC failure takes about an hour to restart everything at the tertiary site… so "minutes" is more like "tens of minutes"… BUT, in all cases, recovery is COMPLETELY AUTOMATED.  I'm serious.  No hot-line, no midnight con calls with 47 people from four states and nine cities.  It's all automated — and wonderfully dynamic.

 

Please scamper (ok , don't scamper)… Just click.  The white paper is http://www.emc.com/collateral/white-paper/h13360-cont-avail-ms-apps-wp.pdf and demos will be following in the next few weeks.  We'll do SQL Server first!

 

Please — comment below if you like it — and please submit your criticisms if you can muster any — seriously.  This is the first paper of this magnitude to be published by a major Cloud enabler — we want your feedback so we can continue to bring you the solutions you've asked for!

 

Cheers!