Welcome everybody! We don't pretend to know it all, but we like to think that we understand how things work. Feel free to comment if you think something needs to be added to our story and ask questions if you wonder how to do the math.
To start the discussion I would like to open with "performance, what exactly is it ?"
Performance is how users / admins perceive responsiveness of applications. One person might think everything is working just fine while another person thinks it's very bad. Can we say that I/O response times of 30 ms are bad? No we can't. "It depends" is a popular answer EMC often uses and they are right: it depends!
Storage performance in the end always comes down to the physical storage medium, whether it is a rotating disk or an EFD (SSD). If your storage array is heavily used the relative performance improvement by using the cache tends to get less and less. What I mean is that if you have a single LUN of 1 GB the cache is only used for that single LUN and performance is great, but when you have 200 TB all sliced up in LUNs and you're using them all the same amount of cache is needed for all of that, so the FIFO (first in, first out) will delete the oldest data from the cache first and with that amount of cached LUNs this might not even be that old data at all which you will notice when you need to access that "old" data again!
I always do my calculations by adding up the physical storage devices. The cache improves the outcome, which is always nice, but at least I'm being cautious and I'm on the safe side.
The magic word in performance discussions is "IOps". 1 IOps is 1 "input or output operation per second". A rule of thumb EMC uses is that a single disk rotating at 15k RPM can handle 180 IOps, a 10k disk can handle about 140 IOps, a 7200 RPM disk can do 80 and a power efficient 5400 RPM drive can only do 40 IOps. An EFD (flash drive or SSD) can handle 2500 IOps. As said, this is a rule of thumb based on small random (4kB) blocks. When the blocksize increases the number of IOps will go down, but the amount of MBps goes up. When the blocksize decreases the amount of IOps goes up and the amount of MBps goes down.
When you combine multiple disks in a raid group (RG) or a pool you simply multiply the numbers by the amount of disks.
But the issue isn't what your array can deliver at it's best, but what your applications need!! And most of the time the administrators don't know what they need . "I need 1TB and it needs to be fast!" That sounds familiar, doesn't it? But how fast is "fast"? Asking them how many IOps they need is almost always answered by some weird look on their face as if you suddenly speak some foreign language. On a Windows host you can do some measurement using the tool "perfmon". You need to find the number of read and write operations per second and you should check the I/O size as well. If the I/O size is around 4kB you can use the rule of thumb provided by EMC, if the block size is larger you need to use lower numbers. The ratio between read and write IOps is important too, since 1 write I/O will trigger more I/Os on the storage back end, depending on the raid level used for that particular LUN. This handicap for write I/Os is called the "Write Penalty" (WP).
The Write Penalty is the mechanism what makes write I/Os slower than reads. The random write WPs for the following raid levels are:
- RAID10 = 2
- RAID5 = 4
- RAID6 = 6
- There is no such thing as read penalty. A single read I/O will always be only 1 actual I/O.
- For RAID10 each I/O is written to 2 disks, so that explains the WP of 2 for that particular RAID level;
- For RAID5 the meachanism works as follows: 1 random block needs to be written, but this has an impact on the parity as well, so what happens is that the old block is read as well as the parity block; then the new data is compared to the old data and the changes in parity calculated; then the new data as well as the new parity are written to disk. So 2 read and 2 write I/Os took place, so 4 in total for this single host write I/O;
- For RAID6 this works about the same as RAID5, only here 2 parity blocks need to be read and written, so the WP will be 6.
For sequential writes the WP is mostly less than what I mentioned earlier, since a whole stripe needs to be written and the parity will change only once. For a 4RAID5 RG (4+1) the WP for sequential write I/O is 1.25 (25% overhead because of the 1 extra disk on 4 data disks). In a 8RAID5 RG the WP is only 1.125 (12.5% overhead). In RAID10 the WP is still 2, since there is no parity, but true mirroring.
If a host has 90% reads and 10% writes only 10% of the I/Os will suffer from the WP, but if the host has only 10% reads and 90% writes, 90% of the I/Os will suffer from the WP. Choose your RAID level carefully, since choosing the wrong one can be desastrous for your performance!
Thank you for the elaborated post Rob.
I agree with you when you say, "Storage performance in the end always comes down to the physical storage medium"
Considering that in mind, If you could explain the importance of RTO and RPO while as a customer/Implementation Specialist, What do you keep in mind at planning stage?
Peformance comes with the cost! However, having said that it also on Storage Admins to make the right choice depending on the storage system.
I always recommend following configuration depending on the customer's requirement:
Performance: RAID 1 + 0 (Expensive compared to RAID 5 but gives better performance)
Capcaity + Performance: RAID 5.
Moreover, if you could share your experience for EMC CLARiiON Storage Systems for Best Performance Practices in Industry for different types of Servers/Applications. (MS Exchange, Database Server etc.)
Ankit, RTO and RPO are usually used in backup scenarios. I can imagine that during restores (RTO) the storage is hammered hard because of all the data coming back from a restore and you might take that into consideration while designing your storage config, but I've never come accross such a scenario. Backing up however does cost performance on a regular bases, but then again: the research you did before creating the design should have revealed that and you should have designed your storage to be able to handle the extra I/O during backups. If you want backups to go faster, you should make sure that you're looking at the right bottleneck. Using an old LTO-1 tape drive will not help you getting anywhere near to backuping up at 750GB per hour. Look at all components if you want to create a new design (and you have the time for it). If time is the issue, make sure you make the right assumptions.
You could take into account that you want to be able to handle peak I/O without delays, but you can also consider to use a lighter config and you agree to have peak I/O times to flatline on 100% utilization for a bit longer if that saves you money. But be aware that doing so, production can feel the impact and it might even be undesirable to take a chance there.
Your advice about RAID10 for performance and RAID5 for Capacity + Performance is a rule of thumb. You should always investigate what you need. RAID5 isn't nessesarily bad for performance. If you have anough drives in a RAID5 configuration, this can even outperform a RAID10 config; simply do the math and see for yourself! Jon will publish a new post shortly and he will explain about how to do the calculation. If you have questions, ask! We can give examples if you want.
Thirdly: I can't say you need RAID10 for Exchange and RAID5 for a file server. Look at the previous explanation. Investigate before making decisions!
Let’s build on Rob’s post with some performance troubleshooting. For example, a server administrator has asked us to create a LUN that can handle 1000 IOps in a R:W ratio of 70:30. We calculated that he would end up with 700 read IOps and 300 write IOps from server perspective. Since we only have RAID5 at our disposal, we calculated that the back-end write IOps would not be 300 IOps but in fact 300 x 4 = 1200 IOps. Remember, this is a “all random I/O”, worst-case calculation. Real life will probably have some sequential write I/O, lowering the write penalty a bit. But since making assumptions can cost you dearly, if you don’t have facts, assume worst.
So Rob designed a LUN that needs to handle 700 + 4x300 IOps = 1900 IOps. Calculating with 15k FC/SAS drives, he ended up with 10.55 drives. Since EMC doesn’t sell parts of a drive, he built a 11 disk RAID 5 group, created the LUN and allocated it to the server. Job done!
Or is it? A couple of weeks later, the server admin comes back:
“my customers are complaining about the performance of the application. It’s the storage, fix it please!”
So how do you continue?
Customers (or server admins for that matter) usually don’t mind about utilization, throughput, R:W ratios, block sizes and the lot. They want only one thing: low response time. Think of yourself: opening Google, you only want the page to be there quickly. You don’t care how many CPU cycles it takes, whether it’s coming from cache or of its efficient on the back-end.
So I always start looking at response times. If I can, I prefer to start on the server end, using perfmon or an equivalent. This gives me the best view from the customer perspective, it allows me to look to the storage as a whole (SAN switches included). And, not unimportant, it allows me to check the assumption the server admin made: if I see <10 ms response times to the storage at all times, my money is on a different server component that’s being the bottleneck.
So let’s assume the server does show response times of a 100 ms. Ok, it’s a storage bottleneck. Make a note of the other perfmon data you have at your disposal: the server will also show you how many writes and/or reads it’s sending. This may come in handy during the following steps, or might give you some clues where to look. For example, if you know your RAID group can deliver 1900 IOps, you know it’s dedicated to this server, you verified that the R:W ratio is in the neighborhood of 70:30 and you only see 1000 IOps going towards the storage, my money is not on the disks or the RG. Probably it’s a bottleneck somewhere else: storage processor utilization, cache configuration, SAN ISL overloaded, etc. On the other hand, if you see the LUN requesting 5000 IOps all the time, with your knowledge of the disk setup you can already fairly certain assume the disks aren’t keeping up.
So after perfmon on the server side, I usually jump straight into Unisphere Analyzer, skipping the SAN switches. I haven’t had too many SAN bottlenecks yet, so I’ll save me the time. Again, I start at response times for the LUN. If the server is seeing 100 ms response times and the storage is reporting 20 ms response times, I know I have to double back to the SAN for the 80 ms gap. If the storage is also reporting 100 ms response times, I know I can focus there. From there on, I start at the storage processors and drill my way down to the disks, checking metrics such as utilization, throughput and queue lengths. Don’t forget to look at the LUN/RAID group configuration as well: maybe at some point someone disabled write cache!
At this point I think there’s no single flow of troubleshooting: it entirely depends on what you find and even more so on how you “connect the dots”. For example, in the previous example where the server was experiencing high response times but wasn’t using that many IOps, we also experienced slowdowns for approximately 50% of the environment and the problem was apparent on two storage systems at the same time. Checking some servers, we quickly came to the conclusion that the problem was only experienced on all LUNs attached to storage processor A. We used the VMware Infrastructure Client for this, which combined with a consistent VMFS data store naming convention made connecting the dots easy.
It turned out that a DBA was restoring a database to a virtual machine, doing so at the perfect speed of 300+ MB/s. The downside was that the insane amount of write throughput completely overloaded the storage processor, causing all other LUNs on that SP to slow down. The reason the other storage system also experienced a slowdown was because both systems were mirroring to each other, causing the writes on that system to slow down as well.
So what is the most common troubleshooting you have to do? What do you find easy or hard, or would you recommend to other storage troubleshooters? Which tools do you use? Let us know!
Thank you for the detailed response Rob and Jon. Rob provided the theoritcal part whereas Jon provided the scenario which helped to understand Storage Admins requirement and a flowchart in what to do to narrow down to what causes it! (connecting the dots)
So what is the most common troubleshooting you have to do? What do you find easy or hard, or would you recommend to other storage troubleshooters? Which tools do you use? Let us know
- When it comes to t/s there are many things we take in to account for, right from physical layer to application. (I can not share the detailed procedure).
- I am still new to all this, but I try my level best to provide the best recommendation as per EMC Best Practice.
- Some properitory tools & Unisphere Analyzer.
According to all this, I can leave the idea that Log disks of Database Servers have to be placed on a RAID10 LUN?
At this moment, the policy at our company it to use RAID10 for Production Databases (High performance) and for testing, deployment and acceptance databases we are using 4RAID5 for higher storage utilization and therefore saving some money.
As a rule of thumb it's an understandable rule, but if you really want to be sure, you should calculate what you need!
Suppose you have a Database server that has 100% writes on the log disk and 85R/15W on the database drive, but the amount of write IOps to the log drive is only 100 as seen on the host. Now it doesn't matter whether or not you use RAID10 or RAID5, because a 2 drive RAID10 (or RAID1) on 15 RPM can handle 2 x 180 IOps, so the 200 host write IOps (2 x 100 because of the WP=2) is no problem at all. On 4RAID5 (4+1) the RG can deliver 5 x 180 = 900 IOps while only 200 is needed. So RAID5 is also not a problem if you have IOps to spare. If all IOps of a RG are already accounted for and you have some space left, do not think you can safely hand this out to new servers as your calculations say that you don't have any IOps left, so eventually you will get some performance problem somewhere.
The consideration I can think of which you are thinking of is that you might have a dedicated RAID10 RG of several disks (for example 6+6) which is only used for logs of more than 1 server. Add up all the required host write IOps and see if your RAID10 RG can handle that and if the space is sufficient. If you have a dedicated pool with RAID10 RGs especially for log drives I think you're safe, because you can easily expand the pool to provide more space or IOps. With the new Flare 32 the existing LUNs will even be redistributed accross all drives after the addition of new drives.
Calculation is always the best method, but if you have lots of servers and the amount of space for the logs and the IOps they need are in balance (you don't have room left when the IOps are all accounted for) you just might have a best practice for your company to do what you do now.
When it comes to optimizing performance you eventually will run into storage bottlenecks.
To name a few:
- Maximum bandwidth of the SAN / storage ports
- Maximum number of outstanding I/Os on storage ports (throughput)
- The amount of cache
- Maximum performance of a disk / RG
- Maximum size of metaluns and so the same as bottleneck number 4
1. We can answer this one very quickly: if the number of MBps to/from your array falls short, you need more bandwidth. You can accomplish this by spreading the load over the existing storage ports more efficiently or adding more ports to your array. One thing you can be sure of: if you're hammering the storage ports that hard so they'll actually reach the max they can handle, the disks are keeping up! So the disk / LUN layout is not a bottleneck at this point.
2. Each storage port can only handle a certain amount of outstanding I/Os. Mostly this number is 2048 for a vast range of arrays. For Clariion / VNX you need to do the math on 1600. If a port gets more than that, the port will issue a QFULL reply to the HBA that requested the extra I/O. This QFULL will trigger an action on the Operating System of that particular host. In the old days such a reply could make the OS loose access to its disk (LUN), but modern OSs can deal with this and will slow down sending IOs to the array to allow the array to "recover" from this overload. This will however slow down I/Os on the host, so slowing down the application. Ways to avoid this are configuring the maximum queue depth / execution throttle in the HBA / OS or moving hosts to other storage ports. In EMC Midrange arrays you can use Analyzer to see if QFULLs are an issue. Set Analyzer to advanced before starting it and open the 3rd tab and select the storage ports. You will see a metric called QFULL which you can select.
3. The amount of cache is always a bottleneck, but you might never encounter a problem. When you do, check if forced flushing is taking place. If it is, the disks may be too slow to be able to handle the load. A wide variety of causes / solutions exist ranging from changing storage tiers, adding disks or changing the layout of your LUNs. Traditional LUNs are carved out a single RG, but a RAID Group can handle only so much IOps and so also will the LUNs that reside on each RG. When cache performance is an issue, this usually is a trigger to go look for problems on the disk side.
4. As each disk can only handle for example 180 IOps, a LUN which resides on a RG can handle only that amount of IOps times the number of disks at the max. If other LUNs are on the same RG, the RGs maximum performance is shared between all LUNs on that particular RG. Furthermore there's a little caveat called "Little's Law" which states that even though 1 disk can handle a certain amount of IOps, this will come at a cost. Up until about 70% of the IOps the response times are reasonably ok, but if you go above that number the response times will exponentially go up. If you plot the response times in a graph you will see that the response times are exponentially from the beginning, but until around 70% this is acceptable. Above 70% a small increase in IOps will invoke much higher response times. So even though you can actually get 180 IOps from a single disk, consider that actually reaching this value might hurt. A lot of performance graphs in the market today will have a threshold at around 70% saying that if the IOps are above, the performance is reaching critical levels.
5. When you have an array with multiple RGs you will notice that certain RGs will perform very nice and others might suffer from heavy duty hosts which will hammer the LUNs (and so the RGs) so hard that they actually suffer from bad performance. A way to avoid this in the past was to implement metaluns. A meta LUN is actually 2 or more LUNs connected together so the IOps will go to 2 or more RGs, so in the end more disks will handle the I/Os. If the load is spread evenly chances are that no single hot spot(s) will occur anymore, but careful planning and rearranging will take up a lot of time from the storage admins. A metalun can be formed in 2 ways: concatenated and striped. If you need to enlarge a LUN for performance reasons and you need to do this on a single RG, you'd better use concatenated expansion. If you want the better (performance) expansion, you need to expand a LUN using equally large component LUNs on other RGs. Striping makes sure that all disks in the involved RGs are used equally.
The storage pool technology is in fact a set of RGs which work together to handle I/Os for all LUNs that reside in this pool. In this pool each RG has its own RAID protection. A single LUN will be spread across all available RGs as will all other LUNs. This way the pool provides a way of load balancing and all disks are used and almost no hot spots will spoil the performance of 1 or more LUNs.
Thanks for your answer...
For me, this whole discussion is very helpful. For our storage systems I am in kind of a transition of traditional raid groups to storage pools. There are several raid groups with only two disks for logging.
I was planning to create a new storage pool for logging containing all the disks of the old raid groups. The idea was to use RAID10 again, but I will check the IOPS needs of the servers.
maybe it is an option to use RAID5 after all.
Yeah, this is one of the things I keep struggling with myself as well. In the field it is not always possible to calculate it perfectly.
Especially now with the storage pools, FAST VP, FAST Cache etc coming up, we've got most DB servers (logs, db, everything) up in the pool with the rest of the servers. My experience is that the sheer amount of disks in the pool absorbs any write penalties with ease. Furthermore, if you have some SSD in the pool my experience (especially with the CX4 range) is that you'll run into storage processor utilization problems way before the disks become a problem. So there may still lie a problem: RAID5 and RAID6 need to do much more CPU intensive parity calculation than a “simple” mirror to another disk.
Of course if you know beforehand that you need many thousands of IOps that is very write intensive and random, I'd still without a doubt go for a dedicated RAID10 RAID group. But don't forget that LUN Migrate is still working; give it a shot on RAID5 and move to RAID10 if it's not working out. (Or the other way around if you have to play it very safe)
Most of the log disks are now in a dedicated 4RAID5 pool, because FAST Cache has to be disabled on Log luns, as per best practise of EMC. (At least, they told me during implementation.)
To create a RAID1/0 Storage pool, I have to use the RAID5 storage pool anyway for migration purposes... So I am curious what RAID5 will bring for the log disks.
If you don't need the IOps RAID5 isn't that bad at all. I've even seen log disks that were several hundred GBs in size compared to the db disk which was perhaps 200GB bigger than the log disk, so you simply can't say 1 size fits all. If you don't have bottlenecks now, check (perfmon for example) how many read and write operations as well as the R/W ratio you have.
Hmmmmm, I'm not aware that FAST Cache should be disabled on the log LUNs. I would expect FAST Cache to be really helpful there as it absorbs excessive writes from RAM cache to EFD drives instead of the slower rotating drives.
Let's check if anyone can help us here: does FAST Cache need to be disabled on LUNs that are used for database logging?