OneFS MRs and How We Drive Continuous Improvement with QA

NOTE: This topic is part of the Uptime Information Hub.

 

ToddDillon.jpg

Todd Dillon, Senior Directory of Software Engineering, brought the Uptime Bulletin to EMC  Isilon after twelve years of experience in the Symmetrix and Storage Managed Service division of EMC. In this article, he talks about OneFS quality assurance testing and how Engineering  determines  what goes into a OneFS maintenance release.

 

Like all other product business units at EMC, Isilon uses the EMC Total Customer Experience (TCE) process to triage every customer-impacting event in OneFS that happens every day worldwide, whether it is data unavailability, data loss, performance issues, disaster recovery unavailable, or other high-impact events. We filter and track anything that affects the customer workflow.

 

The TCE process is our overall measurement of improvement in quality. Isilon Engineering measures and monitors the TCE process, and the EMC Total Customer Experience corporate team—an impartial third party within Isilon—keeps track of how our releases are improving over time. The TCE team owns the metrics that Engineering reports to core senior EMC executive management. Therefore, while many action items are driven by Isilon Engineering and Global Support, the corporate TCE team acts as our unbiased auditing arm to help ensure that when Isilon says we are getting better, you can be sure that we are.

 

If You Can't Measure It, You Can't Manage It

 

We measure our quality based on daily TCE triage. We are tracking exactly which events are affecting our customers at a very detailed level including:

  • The impacts for every maintenance release (MR), and how each MR performs in comparison to prior releases.
  • The number of bugs, events, and duration of events, by release family, at the component level of the product.


How We Drive Continuous Improvement with QA

continuous_improvement.jpg
We have a three-phase process for testing customer bugs that we find internally:

  • We write a fix and test the functionality of the fix.
  • We reproduce the issue and test the fix against the reproduced issue.
  • We test that fix, along with any other fixes that have gone through that process, through a final certification run for two weeks.

 

For every OneFS issue that surfaces at a customer site, we create a field test escape (FTE) record. Specifically, we build automation testing to account for not only that issue, but also for similar issues. Additionally, we build test automation so that future software certification will test for prior defects against the new code.

 

During our eight-week integration testing of a new release of code, such as OneFS 7.2.0, if we find an area of the code that requires a significant fix, we will backport (take parts of a newer version of OneFS and port to an older version) that fix to previous maintenance releases (MRs). This way, we build some of the solid stability fixes from future product releases into the maintenance releases we’re working on right now. We always take fixes from newer releases and backport them, as long as the release does not involve new OneFS features.

 

OneFS QA and How We Determine What Goes into an MR

 

What goes into an MR is partly determined by issues and events found during our internal testing and issues discovered at customer sites. Any bugs found in the field that are fixed through a patch will be integrated into MRs, so that there is parity between MRs, and so that customers are able to upgrade.

 

Ultimately, our MRs are intended to drive broad stability across the entire install base and give customers the confidence to move to a new MR, knowing they are taking a new feature release, but also a release that has full QA accountability—for issues seen in the field, issues we found during our internal testing, and for valuable insights drawn from future releases.