In this article I am going to attempt to thread together a few developments in computer hardware and software history and relate those to a set of technology that is available today for doing analysis of large data sets with a very simple architecture. This should be of interest to anyone working with big data, analytics and the systems that are required to support those capabilities. We are going to cover some history of data science and two new technologies from Microsoft and EMC.
Data Science is a relatively new profession that combines the best of applied statistics and computer science. Data scientists are engaged in a effort to better understand the information buried in the very large data collection experiments made possible in the digital world. For more background on data science see this Forbes online article titled A Very Short History Of Data Science. What is more important for this article is the realization that many IT organizations are struggling to provide robust environments for data scientist to work in. This is not a new problem. Data analytics has been driving hardware and software developers to provide more capability with less complexity since the introduction of vacuum tube computing machine. A Univac I computer with 5200 vacuum tubes was used to successfully predict election results in 1952, long before Nate Silver of the FiveThirtyEight blog was born.
The pioneers of data analytics had to deal with all aspects of the hardware as well as writing the software they used. The need to make data analytics more accessible to a broader audience of statisticians/scientists, many of whom were not proficient in software programming, provided a large commercial software market opportunity for the rapidly growing "mainframe" computer sector in the 1960's. Many of the most popular tools used by data scientist today including SAS and SPSS were first developed for mainframes in mid-late 60's.
Immediately following the introduction of personal computers in the early 1980s, these same same software developers together with a host of new startups rushed to introduce commercial software for data analysis and statistics on PCs. The early versions of the DOS operating systems developed by Microsoft presented major challenges for mainframe software developers. Two important innovations widely available for mainframes were missing on the PC platform: 1) availability of systems with multiple processing cores coupled with multi-threaded operating systems and 2) virtual memory management. These two features are still very relevant to the way we do analytics for big data today which will lead us to why Microsoft R Server together with EMC DSSD D5 are an interesting solution. First I want to cover a little bit of R background and then bring the pieces together to explain why I'm excited about this integration.
If you don't work in the field of data analytics you may never have heard of the R programming language despite the fact that s the most frequently used analytics/data science software according to the 2016 KDNuggets Software Poll. If you have now or are going to have a significant data science initiative, chances are there is going to be a need for infrastructure to run R analytics.
There are a couple of factors that make R so popular. First, R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS (see The R Project for Statistical Computing for more details). R functionality is organized into packages. There are approximately eight packages supplied with the base R distribution. Since R is also and open source project, any one can write and contribute packages to the R project. The Comprehensive R Archive Network (CRAN) currently lists 8,820 available packages in the repository. This is the second reason that R is so popular, it's not about the syntax it's the availability of readily available packages for almost anything you can imagine needing.
The downside to R relates back to all the background I provided above, the vast majority of R packages are single-threaded and lack virtual memory management. For more details on this refer to the R for Windows FAQ. There are many active projects in the R community under way that are "related to pushing R a little further: using compiled code, parallel computing (in both explicit and implicit modes), working with large objects as well as profiling" that are documented on the R Project website (see High-Performance and Parallel Computing with R). There are downsides here as well. These newer packages, for the most part, do not bring together parallel algorithm implementations and virtual memory management. That integration is clearly evident in the Microsoft R Server product.
Microsoft R Server is built using technology that Microsoft acquired in 2015 when they purchased Revolution Analytics. Here is a link to a one hour Channel 9 recorded presentation on the value of the acquired intellectual property. For the purpose of finishing up this discussion I want to talk about just the RevoScaleR package. This package couples parallel algorithms with virtual memory management. Virtual memory management uses a special binary file format (XDF) for storing data and supporting virtual memory management. Large data sets get imported into XDF files prior to performing analysis. The file format provides very fast access to a selected set of rows and/or a specified set of columns. New rows and columns can be also be appended to the file without re-writing the entire file. The capabilities and efficiency of the XDF file implementation support "chunking" functions used by algorithms to data into and out of memory and supports multiple threaded access. This is a big advantage over the constraints of open source R that all data for an analysis must fit in physical memory at run-time. However, we know from experience with RDBMS's that the benefits of data chunking between disk and memory evaporate if the disk subsystem is too slow to meet the demand of highly parallel code running on servers with lots of processor cores. This is where EMC DSSD D5 arrays fit in the solution.
DSSD D5 is a rack-scale flash appliance that combines the performance profile of direct-attach flash with the availability and reliability of shared storage. For existing servers and applications like Microsoft R Server, DSSD provides a Block Driver that enables you to use DSSD D5 as a block device. The DSSD Block Driver manages the interaction between the client applications and the DSSD appliance. This is how R Server would access DSSD D5. Microsoft R Server for Linux and DSSD D5 both support for 64-Bit Red Hat Enterprise Linux 6.x . You can connect up to 48 hosts via multi-path attach to a DSSD D5 storage appliance that would have access to up to 100 TB of usable persistent flash storage at more than 10 million IOPS and 100 GB/sec of sustained throughput, all at ~100 μsec latency on average. This level of scale and performance in a multi-host shared configuration is perfect for supporting a team of data scientists that need the flexibility to change configurations in response to changing data analysis needs. The tight coupling of direct attached flash with a single server is not flexible enough for most enterprise big data environments.
- Highly scalable data science systems require software that is multi-threaded and uses virtual memory management coupled with hardware that can support the resulting workload demand.
- The Microsoft R Server RevoScaleR package includes parallel processing implementations of many popular data science algorithms with integrated virtual memory management support.
- EMC DSSD D5 rack-scale flash provides ultra-dense, high-performance, highly available, and very low latency shared flash storage accessed through PCIe Gen3 that leverages NVMe™ technology.
- Microsoft R Server for Linux combined with EMC DSSD D5 provides a powerful, flexible and easy to implement and maintain platform for supporting teams of data scientists working on enterprise big data analytics solutions.
Thanks for reading
Phil Hummel, EMCDSA