Find Communities by: Category | Product

Oracle Architect

November 2011 Previous month Next month

When we talk about virtualization technologies, a main theme is always about efficiency.  This efficiency is centered on utilization of compute and memory resources.  With traditional deployments on physical servers, one server only processes a single workload.  These workloads rarely utilize more than 10-20% of server’s capacity and most servers don’t even come close to that level.   This leads to massive amounts of compute cycles left unused in our data centers.

 

In the past, we could tap into these unused resources by stacking multiple workloads on these servers.  This could be multiple unrelated databases or application servers or both.  This definitely improves the server usage and increases our efficiency, but comes with many pitfalls.  The most obvious of these is shared maintenance windows.  All workloads running on these multi-tenant servers now have to coordinate downtimes for patching and upgrades. 

 

Another drawback to multi-tenancy is that resource utilization now has to be managed.  Since there are multiple workloads running on a single operating system, we need to make sure utilization did not rise much past 60%.  The reason for this has to do with how operating systems schedule process to run on a CPU.  The busier a system gets, the harder it is to find an available thread to execute the request, for most operating systems this is around 60%.   This means that if one workload started utilizing too much compute, then the other workloads would be impacted and performance would be degraded.

 

With virtualization, these multi-tenant servers can be run in different operating systems, on the same physical server.  In this case, it is the virtualization layer than has to schedule the virtual CPU on a physical thread.  VMware has focused very aggressively on making this process extremely efficient and lightweight and is likely the reason why vSphere has limitations on how many vCPUs can be allocated to a single VM.  Due to this highly efficient virtual scheduling, vSphere is able to fully utilize all physical cores without any significant transaction degradation.  Remember, from my last post, that if a particular ESX server does become loaded, vSphere will utilize vMotion to do a live migration of one or more VMs to another, lesser loaded, ESX host.   This eliminates the manual process of managing utilization.

 

So, by virtualizing, we can now have the best of both worlds.  We get maximum utilization of compute and memory resources, and at much higher levels than multi-tenancy and even other virtualization technologies.  We also get back to maintenance windows that do not require negotiation and cooperation between different groups.  This allows us to focus on other, more important, areas and less on capacity monitoring.

So what would you focus on? 

 

Let me know here:  If some technology gave you 20% of your day back, what would you use that time for?

EMC’s IT group has published a paper outlining how they deploy Oracle in a virtualized world.  This paper describes 4 distinct models for consideration when deciding on making Oracle databases highly available.  The paper begins by describing the journey the EMC went through as they began to virtualize their applications.  The focus is primarily on Oracle databases, but the journey is virtually the same for any organization virtualizing their entire stack.

 

The first deployment model is by far the simplest.  This model discusses virtualizing single instance Oracle databases.  The benefits of virtualizing a single instance are tremendous.  With VMware virtualization, you immediately get high availability.  With vSphere, if the VM goes down for any reason, vSphere will automatically bring that VM back online.  With vSphere 5, you can add database awareness.  This awareness comes in the form of VMware GuestAppMonitor.  I started a discussion of this feature in the Everything Oracle at EMC community, but basically, you can write a script to check the availability of any application or process.  If this script does not complete in a predetermined time, vSphere will restart the VM and/or send out a notification.  No more worries about a hung database node.

 

Also, VMware virtualization comes with a technology call vMotion.  This feature allows an administrator to migrate a running VM from one physical ESX host to another.  This movement does not require any downtime and is completely transparent to the user.  This feature is also used in an automated fashion to migrate VMs from a heavily loaded ESX server to another lesser loaded one.  This feature is call Distributed Resource Scheduler and allows organizations to achieve much higher physical to virtual server consolidations than other virtualization technologies.

 

The benefits continue with many more features and technologies to make Oracle databases run better, faster and cheaper, but I will save that for my next entry.  Please take the time to check out this paper.  I guarantee you will find it interesting.

This question comes up every time I have a discussion about virtualization.   There is even a discussion running on the “Everything Oracle at EMC” community.  The simple answer is, of course it is.  The more difficult discussion is:  what makes someone think that it is not supported or is not fully supported? 

 

I think that the main cause for confusion is Oracle official support statement, note 249212.1 on My Oracle Support.  This note is the official support statement for VMware virtualized environments and, as of November 10th 2011, explicitly states “Oracle has not certified any of its products on VMware virtualized environments.” 

 

Let’s put aside that fact that this is an overly broad statement for a database support statement.  Oracle does not certify the Oracle database on any hardware vender, except their own hardware, and vSphere is hardware virtualization.  The Oracle database certifications are on operating systems and not hardware vendors. A quick look at database certifications on My Oracle Support confirms this to be true.

 

The next issue from this note is the line: “If that solution does not work in the VMware virtualized environment, the customer will be referred to VMware for support.”  This is true for any hardware environment.  For example, EMC had an issue with Oracle 10.2.0.3 running on a Sun Sparc E25K.  At the time, this was Sun’s largest Solaris offering and the database was a two node RAC database with 128 CPU cores on each node.  Oracle identified the bug and gave us the” known to work” patch. This patch did not solve our issue, so Oracle support insisted that we reproduce on a non-E25K environment.  So we went through great pains to try to reproduce this on another environment.  The occurrences of Oracle Support asking the customer to reproduce in another environment are very rare with for any hardware vendor, including VMware.

 

Of all the Oracle customers I have spoken to, and I have spoken with countless, only one has ever had Oracle Support even suggest this to them.  It was first level of support and when the customer pushed back, support gave in and worked with them to solve the problem.  EMC is one of Oracle’s biggest customers and this has never been even suggested once.

 

If we look at the actual database certifications for Oracle databases, we will easily find the many Oracle database versions are fully supported and certified with Oracle Linux. This makes perfect sense, since the database does do certifications with operating systems.   If we look for support of Oracle Linux on VMware virtualization we will find support note 417770.1, Oracle Linux Support Policies for Virtualization and Emulation.  This official support statement, also as of November 10th 2011, clearly states that “Operating system support for RHEL3 (and higher), Oracle Linux 4 (and higher) under the Oracle Linux Support Program on VMware vSphere (ESX Server).”

 

Now if that still does not give you a good feeling, we can look at VMware’s support position on virtualized Oracle products at:  http://www.vmware.com/support/policies/oracle-support.html.  This document clearly states that, if, Oracle support referrers your service request to VMware, they will take complete ownership of the issue until resolution.  I have asked people I know at VMware how often this has happened and the response, tens of times.  Now we know there are literally hundreds of thousands or virtualized Oracle databases out there, so there must be thousands of support cases generated on these databases. I know, at EMC, we generate hundreds of service requests per year and we have only a few hundred Oracle databases. So with only tens of referrals, the support statement would appear intentionally misleading.  I also know that of all of the support requests that have been made with Oracle database on VMware, only one issue has ever been identified as a VMware issue, and that was back in the early days of virtualization.

A discussion started recently on the “Everything Oracle at EMC” community as to whether Oracle RAC is becoming a corner case.  This was an excellent topic and brought about many discussions with my peers at EMC and around the globe. 

Oracle Real Application Clusters, RAC, is primarily used for high availability, HA.  There are some use cases where it is also used for scalability.  We have two of these environments, but they are the corner cases, and must be handled with caution, since most of the outages in these environments are actually caused by RAC itself.

 

In the cases where Oracle RAC is used for high availability, the question becomes how much availability is required?  In IT we usually discuss high levels of availability in terms of 9’s.  This ranges from 99% (20.8 downtime hours/year) to 5x9’s or 99.999% (.02 downtime hours/year ) uptime.  Generally, almost all business can easily survive their most critical applications being down, for an unexpected reason, for 5 or 10 minutes occasionally.   Since almost all Linux based x86 servers have become very reliable, a business can expect that a server crash will be a rare event.  This becomes even more obvious when the server is diskless, as hard disks are the least reliable component and are protected in some sort of storage array. I actually can not remember the last time I lost one of my x86 servers to a non-RAC related crash.

 

This then begs the question as to whether we really need Oracle RAC for HA.  With virtualization technologies that exist within VMware’s vSphere, such as vMotion and DRS, we are now protected from an impending server failure.  This is due to the vMotion technology which allows an administrator to perform live migrations from a failing ESX server to another server.  These migrations do not impact the end user, except for a slight performance drop for a few seconds.  In addition, if a physical server were to fail, vSphere will automatically restart the virtualized OS on another physical ESX server.

With this kind of technology, it is easy to achieve 4x9’s uptime, meaning 52.5 minutes a year downtime. Assuming that a VM can be restarted and the application fully restored in 10 minutes, that gives the application 5 unplanned downtimes per year.  Almost all of EMC’s applications automatically reconnect to a restarted database, so our average is about 3 minutes to restore the application to working order.  Keeping in mind, I can not remember that last time we had a server failure, we can meet this SLA every time.

 

Let’s now compare this to our corner case of Oracle RAC.  With our RAC databases, we generally have one or more unplanned outages per year. They tend to be caused either by either a bug in RAC or a bug in CRS.  In many cases it starts with one node hanging waiting for some resource, all other nodes in the cluster then try to evict that node, or not.  Evictions do not always occur, but all instances in the database then all start hanging and there is not easy way to figure out what the source of the problem is.  This generally leads to killing processes by best guess and/or just restarting all instances.  Total downtime tends to be in hours, when one of these scenarios happens.

 

This is the primary reason we are moving to vSphere for all mission critical databases that do not require Oracle RAC for scalability reasons.  Since vSphere 5, we can now have virtual machines with 32 vCPUs, which is more than enough for 95% of all of our databases.

 

So is Oracle RAC becoming a corner case?  It sure is for EMC IT.