A discussion started recently on the “Everything Oracle at EMC” community as to whether Oracle RAC is becoming a corner case. This was an excellent topic and brought about many discussions with my peers at EMC and around the globe.
Oracle Real Application Clusters, RAC, is primarily used for high availability, HA. There are some use cases where it is also used for scalability. We have two of these environments, but they are the corner cases, and must be handled with caution, since most of the outages in these environments are actually caused by RAC itself.
In the cases where Oracle RAC is used for high availability, the question becomes how much availability is required? In IT we usually discuss high levels of availability in terms of 9’s. This ranges from 99% (20.8 downtime hours/year) to 5x9’s or 99.999% (.02 downtime hours/year ) uptime. Generally, almost all business can easily survive their most critical applications being down, for an unexpected reason, for 5 or 10 minutes occasionally. Since almost all Linux based x86 servers have become very reliable, a business can expect that a server crash will be a rare event. This becomes even more obvious when the server is diskless, as hard disks are the least reliable component and are protected in some sort of storage array. I actually can not remember the last time I lost one of my x86 servers to a non-RAC related crash.
This then begs the question as to whether we really need Oracle RAC for HA. With virtualization technologies that exist within VMware’s vSphere, such as vMotion and DRS, we are now protected from an impending server failure. This is due to the vMotion technology which allows an administrator to perform live migrations from a failing ESX server to another server. These migrations do not impact the end user, except for a slight performance drop for a few seconds. In addition, if a physical server were to fail, vSphere will automatically restart the virtualized OS on another physical ESX server.
With this kind of technology, it is easy to achieve 4x9’s uptime, meaning 52.5 minutes a year downtime. Assuming that a VM can be restarted and the application fully restored in 10 minutes, that gives the application 5 unplanned downtimes per year. Almost all of EMC’s applications automatically reconnect to a restarted database, so our average is about 3 minutes to restore the application to working order. Keeping in mind, I can not remember that last time we had a server failure, we can meet this SLA every time.
Let’s now compare this to our corner case of Oracle RAC. With our RAC databases, we generally have one or more unplanned outages per year. They tend to be caused either by either a bug in RAC or a bug in CRS. In many cases it starts with one node hanging waiting for some resource, all other nodes in the cluster then try to evict that node, or not. Evictions do not always occur, but all instances in the database then all start hanging and there is not easy way to figure out what the source of the problem is. This generally leads to killing processes by best guess and/or just restarting all instances. Total downtime tends to be in hours, when one of these scenarios happens.
This is the primary reason we are moving to vSphere for all mission critical databases that do not require Oracle RAC for scalability reasons. Since vSphere 5, we can now have virtual machines with 32 vCPUs, which is more than enough for 95% of all of our databases.
So is Oracle RAC becoming a corner case? It sure is for EMC IT.