In an effort to open the discussion I will offer a snippet of information that speaks to Availability.
Availability measures the percentage of time the system is able to return data when requested by
the client. Degraded performance is not included in the metric.
For example, five 9s availability means 99.999 percent availability. This percentage translates into
a total data-not-available time of about five minutes and fifteen seconds per year.
Redundancy has many levels. Our first priority is to the data and after that the operations that use the data. We protect the data with different raid types based on priority and add in the additional protection of the hotspare. Today this is common throughout the industry and available with virtually all storage arrays
In regards to the operations that use the data and how we can protect that area of our operations we initially used the Host Cluster and shared data, duplicate HBA’s, duplicate switches and network paths. In the current environments we see more Virtual environments where the actual Virtual host is reduced to a set of files that can also be protected at the storage level.
With so many advances over the last couple of years in the Virtual environment we can now move virtual host across the network to different server hardware to take advantage of additional processor power during peak operating times or in the event of a server failure. With this capability we have virtual machines clustered across different physical servers creating a redundancy at the environment level.
I took a class on implementing Exchange where we had a Virtual machine clustered to another Virtual machine at a Disaster Recovery site. We were using a RP splitter attached to the array to replicate the data across the network for the other side of the Cluster. My thoughts are that this can be expensive and we would need to prioritize the operations in our environment based on our RTO, recovery time objective.
Thank You for your question and your interest.
thanks for the information. When design a system to achieve HA, do you consider performance factor? Even though we could have redundant components at hardware, server,connectivity, storage, etc..If the performance cannot afford the requirements of the application, we would still have down time, like OLTP application which requires strict response time. HA have nothing to do in such a bad response time case.
How do you think?
You are correct; however in regards to any specific piece of equipment we do not consider degraded performance in the metrics for five nines or 99.999% up time. I am in agreement that if the performance of the application is less than acceptable then we must consider the data as still unavailable for our requirements.
We can make the assumption that at some point in time we will experience a failure and we must consider what our RPO, recovery point objective is going to be. RPO is determined by the amount of time between data protection events, and reflects the amount of data that potentially could be lost during an outage. These intervals can be as small as seconds or as long as hours depending on our priorities and the SLA requirements that we have set for a specific application and our business model. An example might be that our online sales will take priority over a backup application. At the same time that we are looking at the amount of data that must be available we must also consider the RTO, recovery time objective, which is required to return the application and its data back to a functioning state and this also includes the performance of the application as you indicated.
I think that it is important to note here that the amount of data, the shorter the RPO and the shorter the RTO will increase our cost substantially. A simple backup and restore has little cost but the RPO is the time between backups and that data can potentially be lost. At the other end of the spectrum we are looking at a clustered environment with duplicate hardware and asynchronous copies of the data at the array level. This duplicated hardware environment has now doubled our initial capital cost and may be worth the expense depending on the business requirements. My personal thoughts are that we will need to be back online with as much as 20% of the data center immediately if possible and another 40% available within hours with no data loss. In that we have many options it will always come down to our business requirements and the cost of our chosen recovery solution.
your five 9s, is this VNX 99.999% or TOTAL availbility from host to LUN? If it's the total, you also need to take into acount the availability of HBA, cables and SAN. If your SAN only has three 9s, you can't get five 9s availibility, even if the VNX has five! It'll be 99.9% x 99.999% = 99.899001, in fact to be scientifically correct, it's 99.9% because the way the 3 nines were presented (1 digit).
So besides the five 9s of your LUN accessibility you also need to take your whole SAN infrastructure into acount.
In the beginning of this discussion, it was directed at the VNX product line and how we can best achieve five 9's at the array level. However, if you are able to see the prior post you can see where the discussion has moved to talk about the environment. Even if we try to reach a five 9's in the environment we find that if we lose a path we can have a performance degradation that can result in performance being below an acceptable level and render data as essentially unavailable according to our business model or an SLA. I agree with your post. I also think that in todays environments that we must go a step further than just looking at an individual Host environment.
I find this conversation very enlightening. In my many years of working in the industry I have spent more than 20 of those years in Tech Support. I came to believe that I was not able to address the things that I thought were important to a Customer because I was limited to the single environment that I was supporting. I had the opportunity and the privilege to make a change and I became a Technical Account Manager. Here I work to improve the entire data processing environment for the Customer and much of what I do today follows the conversation that we have engaged in.
In your response you had ask about Software. Before I speak to software please know that my experience is primarily with EMC and its different product lines. I am aware of the virtual offerings from Microsoft and IBM but my experience is with VMware. In addition, the cost of hardware continues to come down and allows us to off load work from the processors.
Regarding software, I believe that we will continue to become more virtualized and I believe that a great deal of the traditional SAN will move to the 10 gig and faster Ethernet network. I believe that our consumer interfaces will move to the Cloud and to mobile apps and our environments that requires the most security will remain on the traditional SAN. I have to decide what my business model is going to be or become in the future and address those needs. I would create as much of my environment as I can on VMware vSphere and I would also include vMotion, HA, FT and DRS. There may be some new features that I did not address but please remember my expertise is in the storage.
VMotion technology allows the movement of live VMs from one server to another as long as both servers are in the same cluster and share the same datastore. This solution is particularly useful for providing availability during hardware maintenance windows or moving VMs from a failing server. vMotion can also be used for load balancing when there is performance degradation on an ESX host.
High Availability (HA) is a technology that allows you to create a cluster without any specific vendor solutions for clustering. It provides the functionality of cluster enabling OS/Applications to restart at the secondary server if the primary server fails.
Fault Tolerance (FT) provides data protection and business continuity. FT technology ensures zero data loss for application. FT is based on a vLockstep technology where the state of primary and secondary virtual machines maintain the same data and instructions set at given point of time.
In the Virtual environment I would also include a Replication Server to manage the backups. In a previous comment I spoke about how we have a need to have 20% of our business model online and functioning as close to immediately as possible. In this regard I would like to see replication at the array level where it is kept in the same consistence group as the Host cluster. EMC has several different offerings for this function and has an entire structure devoted to BRS. They would have a much better idea as to which offering would be best for any specific environment. I am afraid that I do not have the knowledge to be concise in this area.
Again, thank you.
I know it's not really in the midrange arrays, but what about mirrorview? This heavily leans on a single port on SPA and a single port on SPB. I've experienced cases where mirrorview problems in the end needed both SP to be rebooted which in the end lead to downtime for a selected group of servers. Ok, these probably were not configured correctly, that's a fact, but what I meant to say is that external factors can cause downtime as well.
Software for the VNX….
Unified and Flare code should be at the latest Release. The release notes are available and you can see what is being corrected in each release. This is the latest as I write today: VNX BLOCK OE Bundle 05.32.000.5.201 If used on VNX File or Unified systems, ensure that the VNX OE File code is version 126.96.36.199
Virtual Provisioning (Thin LUNs) have been the most cost effective method of provisioning disk in years. I really like FAST VP, it allows us to segregate high I/O data from less active data within the same Lun structure. I like it with the EFD’s, the 10K rpm FC drives and the 7.2K SATA drives. It is an extra expense but it is constantly looking at our I/O profile and improves performance even if we change the application.
In regards to redundancy, backup/ recovery, and Test environments I believe that SnapView is indispensable. This works great where we create consistency groups, then couple it with MirrorView or an RP appliance. How we use it depends on our goal; are we going to a DR site or a local test environment? The distance between the arrays will be the primary determining factor for which remote copy option we choose.
Please know that these are only my personal thoughts and that the environment and business model will be the determining factor for any decisions.