No Fluff - How to Build All-Flash Nodes with Infiniband

I have not been very active on these forums, but I have been working with designing all flash scaleIO nodes (including NVME) for about a year and a half now.  I certainly didn't work alone on these builds and want to give a shout out to Jason Pappolla (Sys-Admin) and Jun Sato (DBA).  These guys have a huge breadth of knowledge and experience when it comes to VMware, SQL, and storage fabrics.  Earlier this year I wrote a blog on the design approach we used to choose hardware components.  Since then, we learned a few things including implementing Infiniband storage fabric.  I wanted to write up a quick document on what we have found to be the best hardware but without any sugar coating. If anything below is not accurate or has changed with new releases, please let me know and I will update the doc.  After all, this is an EMC ScaleIO forum and we all come here to get straight answers.

 

I am going to list the components in order of how important I believe they are in a node design.

 

Storage Controller:

DON'T USE A RAID CONTROLLER

 

Use an HBA with the LSI SAS-3008 chipset, period. Software Defined Storage works best when the software can directly access the disks. Putting a RAID card in pass-thru just causes performance problems, especially with write heavy loads. Not to mention, HBA's are cheaper than RAID controllers. Here are the model numbers of cards with the SAS-3008 chipset by vendor.

    • Supermicro: AOC-S3008L-L8i
    • Dell: H330 Perc controller
    • LSI: SAS 9300-8I

Infiniband Storage Fabric:

 

First, lets address the misconception of the cost associated with Infiniband fabric over a high quality 10Gb fabric.  When we compared the total cost of 10gbit fabric using two Dell N4064F switches with Infiniband fabric using two Mellanox SX6036 FDR 56Gbit switches, it was only 14% more.  We didn't care so much about throughput (was nice though), what we were after was Infiniband's extremely low latency.  Here's why...

 

Latency between nodes is the silent killer of write performance in a scaleIO system.  This is especially true in SQL because of TempDB's synchronous write load.  For example, if a Dev writes a reporting query that JOINs 6 tables with an average size of 250 million rows each (yes this really happened), then this would cause a huge amount of synchronous write operations and destroy the performance of a ScaleIO system that had a high latency fabric (300ms and above).  Invest in good fabric if you plan on running SQL, especially for reporting servers.  This statement is right out of the ScaleIO Networking Best Practices document...


"Write operations therefore require three times more bandwidth to SDSs than read operations. However, a write operation involves two SDSs, rather than the one required for a read operation" - h14708-scaleio-networking-best-practices.pdf

 

Things to know about infiniband:

    • Max MTU of IPoIB is 4k, NOT 9K
    • Only buy managed infiniband switches. This is important
    • If buying used Connect-x3 cards avoid HP/Dell versions because of locked firmware versions.  Get true Mellanox
    • There is a lot of high quality used QDR (40Gbit) Infiniband parts on ebay,  Perfect for a Proof-Of-Concepts

 

Mainboards

Use a board that supports SATADOMs. By putting the OS on SATADOMs it frees up drive bays for more scaleIO storage and allows the HBA to only have to deal with one type of IO load. An added benefit on the SuperMicro boards is it allows you to use the on-board Intel controller to RAID-1 two SATADOMs together. Can't do this on a Dell R630XD because it only allows for a single SATADOM so invest in a high quality one (I'd sprint for SLC).

    • Supermicro x10 and above (Dual SATADOM connectors with RAID)
    • Dell R630XD (Single SATADOM connector)
    • Look for boards with integrated featues, like Infiniband and SFP+ ports
    • NVMe ports are a major plus

SSD's

I have crunched a lot of numbers on this and with the current cost of enterprise SSDs compared to 10k-15k SAS drives it pretty much doesn't make sense to use up your licensing on spinners (if you are building your own nodes).  Now for some math... Lets take the Samsung SM863 1.92TB SSD that has a 5-year 3-DWPD endurance and costs $1136.22 (wiredzone.com). Thats $0.59 cents per gigabyte of enterprise SSD storage.

    • 8 SSDs per SAS-3008 controller
    • keep in mind the max IOPS of the ScaleIO software per node and don't over provision.
    • SSDs are expensive, hardware is cheap (rellatively).  I.E. don't throw 24 SSDs on a node, spread across 3 nodes
    • If its a production build use enterprise drives.  This should go without saying but this question seems to always come up

 

Wrapping up:

This is by no means a complete list, but it gives a general outline to get you started.  I have a bunch of documentation so feel free to hit me up with questions.  I attached some pics of an 8 node beast of a system I will be helping tune in January.  126TB usable All-Flash nodes (Toshiba 4TB) with dual Infiniband FDR ports using Dell R630 servers.