Building Software and Personnel for the Innovative Use of ICT in the Next Decade
Submitted by fherwig 2010–07–14 08:56:06 EDT
Theme(s): Building Digital Skills
Summary
To increase the power of ICT to drive innovation in our country, Canada must ensure that there are a community of highly qualified personnel that can turn computer rooms full of equipment into finely–tuned machines for driving discovery, mining enormous databases for new knowledge, or powering the newest web services.
The skills required to make use of new, enormous computational resources for innovative research — whether by analyzing torrents of data or crafting new computational models of unprecedented fidelity — will be the foundation for innovation across the economy; these skills are enormously transferrable, and increasingly valuable.
By developing teams of people focused on using Canada's present and future Compute Canada centres to their fullest in developing computational tools for particular problems, we have the opportunity to ensure that we develop a constant stream of personnel with the most current and advanced digital skills that can be immediately transferred to engineering or business analytics, from digital infrastructure management to disruptive e–services. And by focusing that training on particular research problems of national importance, we gain, as a byproduct, tools that our researchers can use to drive innovation in those problems for years after the training is completed.
In recent years, Canada has begun increasing its investment in large–scale computing hardware, and to some extent in HQP development around such systems. Many of Canada's top researchers have been immediately able to make use of such systems by using community codes in their discipline — that is, publicly available research software around which a community develops. The availability of such community codes can not only radically accelerate research, but actually shape the direction of a field, much as large–scale experimental facilities can determine what measurements are possible at any given moment.
Such codes are seldom written in Canada.
The time and breadth of expertise required to develop such advanced software requires different skills than doing the research in the field that the software enables; this is only becoming more true as hardware becomes more specialized. Without a national program pushing the development of these community codes — for simulation science and massive data analysis — our researchers lose important leverage in setting an international research agenda.
We outline here a framework for developing personnel with the skills needed to use large–scale computing in many areas in industry and academia by building computational teams to develop tools in particular areas of national research importance. We propose as a pilot such a team for a strongly interdisciplinary research field with an enormous existing history of computational achievement and a field in which Canada routinely punches well above its weight — astrophysics. If the pilot is successful, teams for a variety of research programs determined to be in the national interest could be developed and deployed, building HQP and enormously powerful research tools in those areas simulaneously. US researchers are already beginning to make calls for similar efforts.
We propose funding several small interdisciplinary teams of lead researchers and junior researchers at various stages of their careers, down to undergraduate students, housed at or near Universities that have or are about to make a commitment to programs in computation–powered research, whether in applied math, computational science, or modeling science. These teams can, while training computational HQP, focus on building tools that been deemed to be high priorities by their communities.
Submission
1. Codes as Tools
The Canadian computational landscape has changed remarkably in the past decade; Canadian researchers in academia and industry now have access to much larger data and computational power, and there is some degree of staff supporting these new large facilities (for an overview, see Dursi et al, 2010; Compute Canada , 2010).
This white paper aims to build on these unquestioned successes by using them to train HQP in the current cutting edge of large–scale computing ('supercomputing', or High Performance Computing, HPC) on problems that are deemed to be suitably important, and in doing so, generate tools that will greatly empower Canadian researchers.
The need to develop and increase funding to a new generation of simulation and data–analysis tools has been long recognized internationally, e.g. in the NSF report Computation As a Tool for Discovery in Physics1 that was the result of an expert workshop almost 10 years ago:
Simply put, we must move from a mode where we view computational science as an applied branch of theory to a mode where its true resource needs as a distinct research mode are recognized. Concretely, this means providing support for building the infrastructure (software) of computational science at levels commensurate with their true costs, just as we support construction and operation of experimental facilities.
The last decade has seen much needed investments into the hardware platforms for scientific computing. There has not been, however, a concomitant commitment to increasing in funding for developing the tools that will continue to make full use of these new engines for innovation. The Compute Canada consortia have support staff (technical analysts) which have been game–changers in the Canadian computational landscape, enabling transfer of expertise to researchers who would not on their own have been able to take advantage of these new tools; but we currently have no mechanisms in place for developing a constant stream of new HQP at this level, or to make use of them in a focused way to solve particular problems.
2. Writing professional scientific code requires professionals
So, what is so different about simulation and application code building compared to regular research activities of NSERC discovery grant funded researchers, or knowledge workers in industry?
To build scalable, efficient simulation or tools for analyzing torrents of data from new sources, requires expertise in the domain science and the mathematical techniques used therein; expertise in the basics of numerical methods (approximation with finite precision, domains of stability, convergence and consistency, validation and verification), knowledge of the current state of the art for numerical methods for all the mathematical operations to be used in the code; knowledge of computer architectures, necessary for efficient computation (cache architecture and performance; vectorization/SSE operations); techniques for efficient parallel computation (shared– and distributed–memory, and increasingly heterogeneous and hybrid architectures such as GPGPU, IBM Cell), and basics of software engineering (unit testing; version control; software architecture; refactoring). This represents a volume and complexity of expertise which requires a team effort and sustained support to build–up.
3. Simulation Tools: ACase study of the importance of community codes to astrophysics modeling
But when quality software is developed and publicly released, and subsequently earns the trust of the community, it can start a chain reaction which greatly contributes to simulation and theory research in its sub–field. We use here as an example theoretical astrophysics, which we've argued is an ideal pilot area for the proposed project.
The MESA stellar evolution code which is presently pushed forward by a small team around a privately funded master programmer (Bill Paxton) is an example (Paxton etal, in prep).2 Over time such community codes become used by a growing community of scientists. Any serious bugs get shaken out, or, more colloquially, "Given enough eyes, all bugs are shallow", Raymond (1999). These codes are then increasing the efficiency of groups in solving science problems, thereby addressing some of the often stated needs of observers to have better models and simulations at their disposal to interprete observations. The community code becomes a platform from which to build extensions by adding new physics or methods, and these extensions in turn get fed back to the rest of the community.
The resulting impact of community codes can be characterized by looking at their use. One of the authors keeps track of computational astrophysics, particularly simulation, papers on astroph3; in the four years from 1 April 2006 to 30 March 2010, there were approximately 3,435 of these. Using the ADS, one can count the citations of the code papers of some of the most–used community codes during that same period of time; these citations usually represent papers that use the code. Citation counts include Gadget–2 (715; Springel, 2005), Cloudy (483; Ferland et al., 1998), FLASH (235; Fryxell et al., 2000), PKDGRAV (98; Stadel, 2001), Ramses (95; Teyssier, 2002), Zeus (59; Hayes et al., 2006), and Enzo (42; O'Shea et al., 2004). There are some issues with comparing these numbers; the simulation paper count may be an undercount (but even so, it also includes non–simulation papers, such as new algorithms for data analysis); in addition, the citations of the code papers certainly overcounts, but probably only modestly, the number of papers using the codes. If we nonetheless take these numbers at face value, the implication is that the top three community codes are responsible for 40% of simulation papers during that period, rising to 50% if one includes the next three. Given the diversity of objects and physics that comprise the field, this is remarkable, and remains so even if a more careful accounting adjusts the percentages downwards somewhat.
That trusted community codes are so often used should not be surprising; once the code becomes accepted within the community as a good way to do certain types of simulations, it becomes a go–to approach for researchers; and in fact the very existence of a well–tested, freely available code to perform a certain task may encourage certain lines of research; whereas other possibly equally or more important research directions simply languish, lacking a comparably available tool. Experience also shows that once a certain seed investment has been made into a computational tool eventually community driven growth becomes self–perpetuating.
Therefore, building or even seeding a successful community code can actually shape research patterns; it provides a simulation tool that is not only less work for a researcher than stopping other projects to write their own, but the result is generally of higher quality. When it comes to research tools, "Better, for cheaper" should be something that as a matter of policy the astronomical community in Canada should encourage, and fund.
4. Analyzing the new floods of Data: The importance of community codes to astronomical data analysis
Besides the prospects for theory, Astronomy is facing the equally exciting and terrifying prospects of enormous floods of very different kinds of data coming from a variety of new observatories. This is not unique to astronomy; business analytics is another field struggling to make use of streams of data previously unexplored; in genomics and proteomics, next–generation sequencing is producing the data flood. New web services such as Facebook or YouTube consist entirely of such floods of data which must be handled and are rich sources of enormously valuable information if they can be analyzed.
Community codes play an important role on the observational side of astronomical research. As pointed out in a white paper for the US decadal survey (Weiner et al., 2009), several software packages or libraries that have been released to the community "have enabled easily as much science as yet another large telescope would have, at considerably lower cost". These can take the form of complete data reduction packages such as the venerable (IRAF), (AIPS) or (MIRIAD); photometry and analysis packages such as DAOPHOT, by Peter Stetson at DAO, (SExtractor) and (GALFIT) and various libraries used to build packages such as FITSIO, PGPLOT, the IDL Astronomy Library, and IDLUTILS. Many of these packages are used so routinely that it is almost impossible to track their usage in the literature.
5. Industrual/commercial implications
Systematically building up and nurturing a Canadian simulation and application code capabilty in astronomy would likely have HPQ training implications beyond the immediate needs of the astronomy community. Some of our theory and simulations graduate students and post–docs end up using their computational skills in industry and commerce. Again, employing the instrument building metaphor, the skill and training of code building and maintenance has similar potential for technological implications for the private sector as has the development of high–tech skills and capabilities that are viewed as a desired and essential side effect of many instrument building funding decisions.
From the exemplary experience of one post–doc at UVic we know that scientists with experience in mathematical modeling and numerical simulations of multi–physics processes are a thought after human resource in research and development groups of companies that need to solve problems or use methods similar to those traditionally met in physical sciences, such as astrophysics. Examples of companies that this particular post–doc interviewed with include Maplesoft (a software company in Waterloo, Ontario), which has a contract with Toyota for mathematical modeling of cars. They needed a person who could solve a complex system of PDEs using their computer algebra software product Maple (similar to Mathematica). Automotive Fuel Cell Cooperation (AFCC) company in Vancouver wanted to hire a simulation scientist to model various physical and electrochemistry processes in hydrogen fuel cells that they are developing for cars in cooperation with Ballard, Daimler, and Ford. General Fusion Inc. in Vancouver was looking for a computational plasma physicist who would perform magneto–hydrodynamics simulations to model collision and interaction of two compact plasma configurations for a future thermonuclear reactor. Similar positions were recently advertised, e.g. with Atomic Energy of Canada Limited (AECL) and U.S. Steel. Another post–doc, who worked in the same simulation group at Lost Alamos as one of the authors was hired by a German subsidiary of Siemens to build a small research group in hydrodynamic simulations of industrial fans. These examples show that the same technological and industrial benefits that are commonly associated with instrumentation engeneering projects can also be associated with appropriately funded and organised simulation code developement and maintenance projects. In fact, we argue that an astrophysics program in this area could address the underserved human resource needs of some Canadian high–tech companies in the coming decade.
6. The increasing difficulty of developing scalable codes for large data/large computing
Building a high–quality, correct, reliable piece of scientific software that others can use for their problems, even for running just on the desktop is difficult. Increasing a problem's complexity by 25% can increase the complexity of the software to solve it by a factor of two (Woodfield, 1979). Building software to be reusable by others is three times more time consuming than to whip out something that will only be used once (Thomas, Delis & Basili, 1997) — but if the software is to be used by more than three people, it is obviously worth it. Similarly, 40%–80% of software development cost is maintenance (Boehm, 1973). But for a piece of software to be useful to the community, it must be general enough to be useful for many problems, reusable, and maintained.
Truly cutting edge scientific software that can take advantage of modern parallel computers, however, is still harder. "There is widespread agreement that trends in both hardware architecture and programming environments and languages have made life more difficult for scientific programmers" (Graham, Snir, & Patterson, 2004).
The most obvious increase in complexity is in parallel computing. The primary library for programming in parallel on large clusters is with (MPI), but this is an extremely difficult way to program complex parallel codes. Programmers who are new to parallel programming demonstrate a wide spread in development times (a factor of 10!) in developing parallel MPI code from scratch, even at a fixed level of quality; the spread between novice and experienced programmers is presumably even greater;(Almeh, 2007). Even in a mature, parallelized code, the MPI sections require constant work; apparently roughly 25% of the maintenence of FLASH is done on the MPI parts (Hochstein, Shull, & Reid, 2008).
For simple problems, (OpenMP), a shared memory approach which is viewed as much easier than MPI, only takes about 60% less effort than MPI, at the cost of being impossible to scale to large clusters (Hochstein, 2006).
But the true challenge for the coming decade for efficient simulation and application code developement will be the increasing heterogeneity of computing hardware. Even now a single compute node may already have 8–32 processing cores, suggesting different parallelization techinques within a node as vs. between nodes. In addition, accellerators such as GPUs within a node are becoming more common. The IBM Cell architecture is another example of hybrid compute nodes.4 This shift to heterogeneous architectures makes programming even more difficult — for instance, the second biggest computer in the world has no fewer than three types of processors, using it to its fullest requires use of all three.
None of these hurdles are insurmountable, but it is becoming simply impossible for a scientist to be at the same time at the leading edge of all the aspects of simulations science; or to write high–quality, correct, reliable, software from scratch for the current crop of supercomputers while staying on top of the discipline area. To employ once more our instrument metaphor, we have to implement in the computing arena the type of collaboration and division of labor that is well established between observers and instrument builders in astronomy.
While in some other countries the era of every researcher using home–grown software is mostly gone (take for example the US with its long established system of national supercomputing centers that are at the same time nucleation points innovative code developement, complemented by the DOE SciDAC programs and previously the ASC program). Canada should follow those examples and enter an era of professionalization of the scientific software community, where scientists can rely increasingly on community codes.
7. Case Studies: What it takes to develop a community code
To understand what is involved in developing a community code, we briefly discuss the development history of the top three simulation codes by citations, listed above. The three are very different types of code with quite different development histories, and distribution moels, but all demonstrate that to build a successful community code requires amounts of work measured in person–years, not–months; extensive documentation; continuous maintenance; test cases so that a user can ensure that it is working properly on their system; and that the code must have been written to address a real need to perform specific research.
7.1 Gadget–2
Gadget–2 is a SPH code which evolves ideal–gas hydrodynamics under the effect of self–gravity, as well as gravity–only dark matter dynamics. It includes two gravity solvers — one purely tree–based, and one that also involves a (possibly periodic) mesh. It consists of 12,500 source lines of code (e.g., omitting comments and blank lines), and comes with 46 pages of documentation, the 30–page code paper, and data for most of the test–cases covered in the paper.
Gadget–2 was developed essentially solely by Volker Springel, at the time at the MPA in Garching, and was written for his own requirements in performing galactic and cosmological simulations. The coding, testing, and writing of documentation required several person–years of labour, by someone already well experienced in the area — the original Gadget–1 was produced during Springel's PhD work. The development was supported by the unusually independent and longterm nature of postdocs at CfA and MPA. As is often the case, the author published several results with Gadget–2 before widely releasing the code, allowing a balance between publicly releasing the code and still benefiting scientifically from the development (e.g., not get 'scooped' by users of the code he wrote). Gadget is distributed with source code for reading results into various packages (IDL, sm, VTK) for analysis, and recently a 3rd party package has started to include support for visualizing results.
Gadget has a 'one–way' code distribution mechanism; the more or less complete Gadget–2 code was released when completed, and it has been updated only with very minor bug fixes since. A community of users has built up around it; many users of Gadget add their own subgrid models or other physics, sometimes making this more widely available to the community, but usually not — this is usually due to proprietary concerns, but also because of the time required to maintain and support publicly–released code. The author continues with his collaborators to develop a proprietary version with additional physics. There is a community mailing list where users post questions, which are often answered by other users.
7.2 Cloudy
Cloudy is a spectral synthesis / photoionization simulation code: the spectrum of material of a given density/composition/temperature profile is calculated. Cloudy has been actively developed since 1978; it started started in early versions of FORTRAN, and has since been rewritten in C++. It currently consists of 142,000 source lines of code, and 500 pages of documentation that extends into four volumes. In addition to the usual amount of work that goes into a scientific code, the Cloudy developers have also by necessity spent a great deal of time curating a database of atomic rates and line widths that are essential inputs to the code. Tending to and maintaining this database has required a great deal of work, and the developers provide it in a format which can be used by other codes fairly easily.
Estimating the amount of effort that has gone into a code over three decades is almost impossible, but simple software engineering models (Wheeler, 2004) suggest that several dozens of person–years would be required, which seems immediately plausible given the length of time and the number of developers involved; there are typically three or so active developers at any given time. The NSF has funded cloudy development for 29 years continuously; this sort of stable funding for computational work is almost impossible to imagine in the current Canadian astrophysics funding model.
The Cloudy community is centred around a wiki and a web–forum where the developers and knowledgeable users answer user questions. It is distributed with its extensive documentation and several test suites, and tools for analyzing and visualizing results. Suggested updates are accepted from the community, and incorporated into the code distribution. New releases, involving significant rewrites are released infrequently (every several years) but versions, involving more minor changes, and 'hotfixes' — urgent patches — are released more often.
7.3 FLASH
FLASH is a multi–physics adaptive mesh hydrodynamics code, including self–gravity through multipole and multigrid solvers, hydrodynamics (several solvers), MHD (several solvers), several equations of state including for degenerate matter, optically thin heating/cooling, nuclear reactions. The FLASH code consists well over 400,000 source lines of code, and a 330 page manual. The user community is arranged around a web page and several mailing lists, and the FLASH centre gives tutorials which are posted online. A test suite is run on the code daily, identifying problems as soon as they are checked into the code base. The FLASH–centre internal version of the code contains test versions of solvers which are not publicly released; typically the internal users publish using the solvers before releasing them at large. An analysis/visualization package based on IDL is distributed with the code, and an open–source visualization/analysis package is available separately. The FLASH code is not as freely available as the other two codes on this list; its funding agency (the US DOE NNSA) requires users who want access to the code to sign a license agreement forbidding re–distribution and to fax the result back in. Users from countries the US considers 'unfriendly' cannot get access to the code.
FLASH was funded as part of a larger research project, and it is difficult to disentangle the costs and contributions of the code development as apart from the rest of the project. However, well over two dozen people have committed significant portions of code to the code base over the 10– year period of development, which in turn began with existing code bases from PROMETHEUS, PARAMESH, and microphysics packages previously written by original FLASH team members. The numerical methods for hydrodynamics (i.e. the piece–wise parabolic method, PPM, ¡ Colella & Woodward, 1984) preceeds even 20 years back. Over most of the official ten year period of funding of the project, at least two or three people were employed full time to develop the code and associated tools; thus, not including the scientists who contributed physics solvers, this already constitutes 20–30 person years of supported effort.
FLASH is continuously updated with bug fixes, and new solvers; only FLASH team members can add code. Typically the developers have a chance to test, and then publish with, new solvers before they are incorporated to the main distribution; new versions of the FLASH code are released every year or so. Very occasionally new code is accepted from the community, but this is typically from close collaborators. Many users have added their own solvers for, eg, radiative transport; but as with Gadget, these are very seldom publicly released.
7.4 Commonalities
We see that for a community code to be successful, it needn't be especially large; the development team size can vary, as can the distribution method and the frequency (if at all) of updating the code. All that seems to be required is that: the code be high quality; moderately easy to use and understand, including having appropriate documentation; come with tests to make sure it is installed properly; and come with all the tools needed to perform the necessary workflow (e.g., create initial conditions, visualize/analyze results). Most importantly, however, the code must address a real research need of a broad community, and be of high quality; and high quality means supporting person–years of effort. These are for future projects person–years involving different persons in a team with a combination of capabilities and expertiese as described above. Such teams can only work with the required stability if they are matched with a corresponding funding instrument.
8. Developing Community Code Development Teams
We recommend creating teams with the following three–part mission: to (a) build up the simulation and data analysis capability that is needed to take full advantage of the increasing fidelity of HPC; and (b) enable large–scale application code development, maintenance and user support, (c) provide the Canadian community it aims to serve with the software it needs, directed through a broad advisory and consultation program. We argue that astronomy would be an ideal test bed for such a program; if successful, it could easily be copied and moved to other fields, from process engineering to business analytics to web services.
The result of such a program would be highly qualified (and widely employable) personnel in the increasingly important areas of simulation science and large–scale analytics, and software that serves the needs of Canadian industrial and academic research communies. A specific plan for executing such a program would be to fund 1–2 teams in year one each involving 5–7 interdisciplinary members, from undergraduate and graduate students to more senior researchers and one team leader, preferably at or near an institution which already has a relevant academic program; this would be a competitively allocated process. These teams could be headquartered at a University, or one of the existing Compute Canada consortia. Over three years, this process would ramp up to 3–5 teams across Canada. These teams would generate cutting–edge computational tools over the duration of the programto be deployed on Compute Canada systems. They would focus their work on things that are not readily available from other sources.
There would be recompetition every five years with the intention to provide competitive continuity where feasible. The code teams and user groups could be loosely joined by some national virtual institute, and if successful this strategy may serve as a template to be employed by other disciplines. Besides the immediate benefits of software development and training, this would help lay out the pathway to academic, scalable peta–scale computing in the next decade.
9. Conclusions
In this white paper we describe the role of simulation and application code developement and maintenance as a criticial component in the national computational landscape. Presently this component is underdeveloped. This situation poses a risk for the healthy progression of the field in the next decade. It will take the concerted effort of researchers, teachers and practitioners on the ground as well as changes to the funding environment in order to address this situation. Over the past decade an important step has been made with funding priorities for computational hardware. These developements have to be sustained and need to be complemented now by comparatively modest investments in a science oriented systematic software development program. The benefits of such a program would reach through the entire economy.
Notes
4 Recently, Los Alamos National Lab in collaboration with IBM have demonstrated how the future of parallel computing could look like when they combined 5 different architectures to work together in one hydrodynamic simulation code SC09 OpenCL Fluid Simulation.
References
Weiner, B. et al. 2009. "Astronomical Software Wants To Be Free: A Manifesto", In Astro2010: The Astronomy and Astrophysics Decadal Survey, arxiv.org:0903.3971.
Dursi, L.J. et al. 2010, Computing, Data, and Networks LRP2010 Whitepaper.
Colella, P. & Woodward, P.R. 1984, Journal of Computational Physics, 54, 174.
Compute Canada Midterm Report.
Raymond, Eric S., 1999. The Cathedral & the Bazaar. O'Reilly. ISBN 1–56592–724–9.
Wheeler, D., 2004. "More than a Gigabuck: Estimating GNU/Linux's Size".
Springel, V., 2005. "The cosmological simulation code GADGET–2", MNRAS 364, 1105–1134.
Ferland, G.J., et al., 1998. "CLOUDY 90: Numerical Simulation of Plasmas and Their Spectra", PASP 110, 761–778.
Fryxell, B. et al., 2000, "FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes", ApJSS 131:1, 273–334.
Stadel, J.G., 2001. "Cosmological N–body simulations and their analysis", Ph.D. thesis, University of Washington.
Teyssier, R., 2002. "Cosmological hydrodynamics with adaptive mesh refinement. A new high resolution code called RAMSES", A&A 385, 337–364.
Hayes, J.C., et al., 2006. "Simulating Radiating and Magnetized Flows in Multiple Dimensions with ZEUS–MP", ApJSS, 165:188:228.
O'Shea, B.W., et al., 2004. "Introducing Enzo, anAMRCosmology Application", eprint arXiv:astroph/0403044.
Woodfield, S.N., 1979. "An experiment on unit increase in problem complexity", IEE Transactions on Software Engineering, 5:2, page 76–79.
Thomas, W.M., Delis, A. & Basili, V.R., 1997. "An analysis of errors in a reuse–oriented development environment", Journal of Systems and Software 38:3 page 211–224.
Boehm, B.W. "The high cost of software". In Proceedings of Symposium on High Cost of Software, Monterey, California, 1973, page 27–40.
Graham, S.; Snir, M. & Patterson, C.A., 2004, "Getting Up To Speed: The Future of Supercomputing", Technical report, [U.S.] National Research Council.
Almeh, R., 2007. "Investigating the effects of novice HPC programmer varations on code performance". M.Sc. thesis, University of Maryland, College Park.
Hochstein, L., 2006. "Development of an empiral approach to building domain–specific knowledge applied to high–end computing". Ph.D. thesis, University of Maryland, College Park.
Hochstein, L. and Shull, F. and Reid, L.B., 2008. "The role of MPI in development time: a case study", in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press, page 1–10.