Training for Ultrascale Computing: Q&A with Donald Frederick of the National Center for Computational Sciences
Jun 30th, 2009 in Highlights
The Cray XT Jaguar, the world’s fastest computer for open research with a peak performance at 1.6 quadrillion calculations per second (or petaflops) and more than 150,000 processors, is a cutting-edge resource for scientists. Jaguar is operated by the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL), but scientists can use the system from their home research institutions. Through the Innovative and Novel Computational Impact on Theory and Experiment, or INCITE, program, in 2008, the U.S. Department of Energy allocated processing time to 38 scientific research projects that require Jaguar’s speed to solve complex problems ranging from the mysteries of protein folding to the somewhat unpredictable physics of fusion.
Most scientists who are allocated time on Jaguar have some knowledge of parallel computing—computing with many processors—but they still have questions about how to effectively use supercomputers with thousand to hundreds of thousands of processors. The NCCS offers workshops for users of high-performance computing systems to acclimate them to Jaguar’s mammoth scale and the way the center as a whole functions.
We asked Donald Frederick, NCCS training coordinator for high-performance computing, to discuss such workshops and what they offer Jaguar’s users.
Tell me a little about the workshops.
Frederick: The workshops take place at or near ORNL for 3 to 4 days a few times a year. Some of the people who register work at ORNL, others at the University of Tennessee, but many are users from outside the Oak Ridge area who travel in for the workshops. We try to assist users by giving them detailed information about the features of Jaguar that will help them get the most out of using the machine.
Who assists users during a workshop?
Frederick: When users come here they get the chance to speak directly with experts from the NCCS. Users also get to talk to experts from Cray, the company that built the high-performance computing system, about the details of the Cray operating system, as well as to the compiler vendors, and Advanced Micro Devices (AMD), the chip manufacturer. This center has a rather unique setup because we assign computational scientists—scientists in fields including biology, physics, chemistry, and applied mathematics who use supercomputers to do their research—as liaisons who work closely with users because they understand how the supercomputer operates. The liaisons primarily work with developers on projects to make the codes function better or solve whatever other problems they might have. At the workshops users can meet their liaisons face to face.
What is a typical workshop day like?
Frederick: The workshops are organized around morning presentations when we have talks from experts at the institutions that have contributed to the supercomputer, including the Department of Energy’s NCCS; the National Institute for Computational Sciences, the National Science Foundation institute that operates Kraken, the most powerful academic supercomputer; the Portland Group Incorporated, a company that makes compilers to translate computer source language into programs that people can use; AMD, the company that makes computer processor chips; and, of course, Cray.
In the afternoons we have hands-on sessions. If new users haven’t been running on the system, they’re encouraged to bring their codes and get them compiled for the system. They may even be able to port their programs to Jaguar if there’s time. Compiling a code, even if it is complex, takes a relatively short period of time.
How much knowledge of supercomputing does a scientist usually have when starting out on Jaguar?
Frederick: It varies. Our user groups tend to have more knowledge than the average research group. Jaguar users are scientific researchers who require large-scale computing, so their projects have been running at computing centers with thousands of processors for a few years before they get to this level. They have already written mature codes to compute their data.
You don’t write a Jaguar-type code in an afternoon. It takes a while. Writing a code is often a multiyear effort that takes many people.
We assume that the project developers already know the basics of parallel computing. What we’re helping them with are the specifics of how to get the best performance for their applications on this particular computer architecture. This system has unique features. For instance, the best way to compile a code depends on whether something is running on the Cray XT4 or XT5. The XT4 system was upgraded from 119 to 263 teraflops with 36,000 processing cores in June, and the first petascale machine, the XT5 with 150,000 processing cores, was added in November. Most personal computers have only two processing cores. Together, the two high-performance machines make up the 1.6 petaflop Jaguar XT.
What are some common questions researchers ask?
Frederick: One of the common questions is how to get the best performance out of either the CPU (central processing unit), which in our case, is an AMD quad-core chip, or out of MPI (message passing interface), meaning “How can I get the processors to efficiently communicate information to one another?” Also, input/output (I/O), or the process by which data is written to disk, is becoming increasingly important. What appears to be a simple task when it involves a small number of cores and a small number of disks can become a challenging one when you’re dealing with more than 100,000 cores and thousands of disks. In the last workshop, we gave a talk with tips on how to best use I/O. When you’re using tens of thousands of processing cores and up to thousands of disks, there are some ways of doing it that are better than others.
Users can try out different methods during the workshop. It gives them avenues to explore in the future when they have more time. I can certainly say that I’ve seen some people really step up their work after they’ve gone through the workshop. They are able to run larger and more complex jobs on Jaguar.
Will any changes be made to the workshops to help with user outreach?
Frederick: We might do more web-based training. At the last workshop we used Live Meeting, a Microsoft program that links video and document files online. We set up a conference call with some of the Cray presenters and showed their slides over the internet. We could hear them, they could hear us, and they could still answer questions immediately. We’re probably going to do more workshops like that. That would allow people to get the benefit of live training without the hassle and expense of traveling.
Where can users find more information?
Frederick: If users visit the NCCS website at www.nccs.gov, they can browse upcoming workshop dates and topics under Training and Education under the User Support tab on the homepage. We have also provided instructional videos and links to websites that explain different aspects of parallel computing.
— by Katie Freeman
Katie Freeman is a science writing intern with the National Center for Computational Sciences.

