Whipping the SciClone

Living with a complex

Living with a complex:  Rachel Taylor and Tom Crockett get ready to pull a SciClone node. Taylor has identified anomalies that might speed up the computer cluster.

Anomalies may hold the key to speeding up William & Mary’s scientific computing complex

Combining the power of 159 computers and 475 individual processors, SciClone, William & Mary’s scientific computing complex, is an important resource for the College and a unique feature for a campus this size. Rachel Taylor ’11 has developed a suite of software to monitor the performance of SciClone. Tests performed by this software have led Taylor to believe that SciClone holds the potential to speed up computational time considerably.

Taylor is an undergraduate research assistant working with Tom Crockett, the College’s manager of high performance computing. Crockett points out that speed is one of the driving forces in computational science. Both agree that a faster SciClone would have important implications for the William & Mary research community.

“One thing that computers are really good at is doing a series of single tasks really fast,” explains Taylor. “But, eventually, because of the laws of physics, you can’t make them do those tasks significantly faster. If you want to be able to have your computer do more stuff in a shorter amount of time, you need more computers. You take whatever your computational task is and you divide it up into a lot of little bits. You give one little piece to each computer in the cluster.”

Crocket explains how this general approach, splitting a complex problem into manageable bite-sized pieces, predates the computer. “In fact, before computers were available there were folks whose job title was ‘computer,’ says Crockett. “They would sit there with slide rules or mechanical calculators, and they’d do exactly what we’re doing now with computers. Each of them would work on a different piece of the problem and then they’d put all the results together.”

Wide range of projects

SciClone is available to anyone on campus who needs it. “Anybody that has a worthy project gets time to run it and is supported by the College and by grant funding,” explains Crockett. “We have a very wide range of projects, and it’s constantly changing. Our biggest users recently have been the Virginia Institute of Marine Science and applied science. Over a period of more years, the physics department has been one of the heaviest users.” Other users include mathematics, computer science, psychology and economics.

A math major, Taylor created monitoring software to evaluate the speed at which nodes—individual computers within the complex—can relay messages back and forth. She explained that most of the time, nodes will relay a message back and forth at the same rate, regardless of the individual node or the message being sent. However, Taylor discovered that sometimes the message is delayed. Interestingly, when a delay occurs, the delay is always the same amount.

Perhaps more significantly, sometimes the messages are communicated faster—again, always by the same amount.

“When everything is working, they’re all the same, all close to the peak performance,” Taylor says. “But, sometimes they’re slower, just a little bit. But, a significant enough amount that we’re wondering: ‘Hey, why is this happening?’ And then, sometimes, they will be faster, which is really weird.”

“The question is why?” says Crockett. “An even more interesting question is: How can we get them all to go fast? All the time.’”

Taylor is devising a series of experiments to figure out the causes of both the fast and the slow anomalies. Her ultimate goal is to figure out a way to make the fast anomaly the default running mode.

“A lot of projects run for days on end,” continues Crockett. “Some VIMS applications will run for ten to fifteen days. If it’s running for ten days and you can get your results back a day earlier, that’s pretty helpful.”

It’s always in use

There is one big problem: Examining SciClone is a challenge, because SciClone is constantly being used. “It can be hard sometimes to do controlled experiments because the system is always in use,” explains Taylor. “If other people are doing stuff on the system, we can’t just kick everybody off.”

Crockett agrees: “The system is just so busy now, we don’t have the luxury to do pristine experiments which is what you would like to do if you were doing a real scientific study. We’re just trying to understand the behavior of the system.”

Ultimately, Taylor’s results will enhance our understanding of SciClone—and perhaps other cluster computer systems.

“You like to think that computers are deterministic,” Crockett said. “We have this collection of identical hardware and you run the same experiment ten times. You would like to think that you would get the same performance ten times, and we’re not. And that’s what is making us unhappy.”i