Pages

(+)

Friday, 16 October 2009

Fighting Bad Memories: The Stressful Application Test

Recent Posts

We've just released Stressful Application Test (or stressapptest), a hardware test used here at Google to test a large number of components in a machine. The test tries to maximize random traffic to memory from processor and disks with the intent of creating a realistic high load situation. The source code is available under the Apache license.

stressapptest may be used for various purposes:
  • Stress test for machines.

  • Hardware qualification and debugging.

  • Memory interface test.

  • Disk testing.



The stressapptest team (from left to right): Matthew Blecker, John Huang, Raphael Menderico, Nick Sanders, John Hawley and James Vera

Photo credit: Taral Joglekar


stressapptest is a user space test, primarily composed of threads doing memory copies and direct I/O disk read/write. Since many hardware issues reproduce infrequently, or only under corner cases, the idea behind the test is that by maximizing bus and memory traffic, the number of transactions is increased, and therefore the probability of failing a transaction is increased. It loads the memory with specially-designed patterns that cause the signal lines to rapidly switch between 1 and 0, drawing the maximum amount of power and cause maximal noise on the nearby voltage rails. Noise on voltage rails and coupling with other nearby lines is likely to cause signaling problems on marginal lines. Also, given a probability of any signal level transition failing, these patterns have the most memory transitions per period of time, and are thus more likely to exhibit a failure.

This test was designed to test all memory available on a machine, which is not guaranteed with the execution of a CPU-intensive application (for instance, compiling the kernel on multiple threads). Moreover, it is focused on testing the memory interface and connections, not the memory internally, like memtest86. As a consequence, Stressful Application Test will detect errors not detected by regular memory tests or extended executions. A comparison with some other memory reliability tests showed that about 20% of the DIMM-related failures detected on the machines tested were only detected by Stressful Application Test, and it was capable of reporting 70% of all DIMM errors detected by all tests.

We hope this software will be useful to system administrators who need to diagnose and repair DIMM or other components. We look forward to your questions and feedback in our discussion group. Happy hacking and may your testing be less stressful!

(+)