Compilation Technology for Scalable Parallel Computing
Research in open systems has become increasingly important in the last five years as computer networks with heterogeneous nodes and multiple styles of programming have become commonplace. This project involves developing compiler technology to provide efficient execution of actor programs on heterogeneous distributed networks. In particular, it focuses on efficient execution of fine-grain concurrent programs. Current implementations are on multicomputers. Issues in resource management, such as automatic reclamation of inaccessible potentially active objects, are being investigated.
We are part of an evaluation team that studies scalable systems and optimal performance for applications with minimal reprogramming. This team is using the Convex Exemplar to examine system software and hardware, synchronization mechanisms, caching strategies, communication and message capabilities, and I/O from the standpoint of scalability and performance.
The rate of increase in the computing power of massively parallel machines has outstripped advances in input/output systems. We are studying the input/output patterns of large-scale scientific computations to develop the new abstractions, data management algorithms, and data organizations required to achieve scalable performance. In particular, the new abstractions must expose application access pattern information. The input/output system must implement the abstractions as best possible subject realities of hardware performance characteristics. We are examin-
ing a variety of organizational questions that drive the
design and implementation of high-performance parallel I/O
system.
Highly regular, array-oriented computations constitute a significant majority of computation-intensive scientific and engineering applications. Data parallel languages such as High Performance Fortran simplify and support the parallelizations of such applications. We are exploring novel techniques for making such languages more efficient and flexible. One of the techniques developed causes different parallel subcomputations to overlap, increasing the tolerance of the performance to communication latencies. A rich set of intrinsics is being developed. The system provides flexible intermodule connectivity, so a high degree of reuse of parallel software becomes possible. Also, data parallel components can be integrated with nondata parallel components at will.
Debugging parallel programs and improving their performance is a daunting task. We are developing debugging
and performance feedback tools that can maintain and exploit refined application-specific data better than is possible with traditional techniques. We believe that such tools can be significantly beneficial in developing and improving parallel programs. This work is carried out using the Charm/Charm++ object parallel programming system. It is being extended to other languages in a multilingual framework. By exploiting ``specificity'' at language constructs, and by recording specific events, it becomes possible to provide visual feedback on performance and to suggest improvement paths via expert analysis.
A small grain of parallelism is essential and natural in executing irregular symbolic computations. The efficiency of a parallel processing system depends on how uniformly these granules of action are distributed to processors. We have developed a dynamic load-balancing scheme that speedily distributes newly created work to the ``needy'' processors. It employs a corrective redistribution component and saturation control, i.e., not moving pieces of work around while everyone has sufficient work for himself. Also, new strategies appropriate for the current generation of multicomputers are being designed.
Some of these strategies support prioritized load balancing, and control memory requirements, simultaneously.
We are extending our previous research on Charm, a portable parallel programming language based on message-driven objects. Message-driven execution makes it possible to adaptively overlap computation and communication, even across multiple modules. Charm supports dynamic creation and load balancing of parallel objects and specific information-sharing abstractions. Its ``branched chare'' and ``chare array'' constructs facilitate interfacing parallel modules and implementation of distributed data structures. Charm is highly suitable for irregular parallel computations. Several applications and libraries are written using Charm. Current research includes support for heterogeneity and client-server environments.
There is a large class of interesting computational problems with the following property: if one attempts to solve them in parallel, one often ends up solving subproblems that are not needed or that are not solved in a sequential execution. This leads to anomalous behavior. The speedups may vary from sublinear to superlinear from run to run and may increase or decrease with addition of processors. We aim at obtaining consistent and monotomically increasing speedups for such computations. Such problems arise in state-space search, branch-and-bound, game-tree search, planning, and theorem proving. Each of them requires a different set of techniques. We have succeeded in obtaining very satisfactory results for state-space search and are working on other problems.
To tackle the difficult problem of parallel software development, many parallel languages are being developed, each with its own unique features and advantages. To benefit from this multitude of languages, one should be able to compose modules written in different languages into a single application program. This is difficult because of different scheduling models assumed by each language runtime. We are developing Converse, a framework for facilitating such interoperability. Converse also simplifies development of the runtime systems for new languages. Converse includes components supporting flexible threads, message passing, and processor scheduling. Several languages are being implemented using Converse.
We have developed a compiler that translates MATLAB code into Fortran 90. The compiler makes use of several advanced techniques for type inference and generates code for runtime disambiguation when static analysis is not accurate. The compiled code is up to three orders of magnitude faster than the interpreted code and is much faster than code generated by commercial counterparts. Work is underway to link the output of the compiler with Polaris to produce parallel code in order to obtain additional speedups.
We have gathered from researchers at various institutions a set of Fortran programs used in current scientific applications. We are using these codes along with industry-standard benchmark programs from the Perfect and SPEC suites to study the transformation techniques needed to optimize them for efficient execution on multiprocessor supercomputers. We have analyzed a substantial fraction of these programs and plan to complete the analysis by the end of 1996. Each loop has been analyzed separately to identify limitations of current parallelization technology by contrasting the transformations and speedups obtained manually versus automatically. The results of this study should be valuable to improve the effectiveness of today's parallelizers.
The lack of cache coherence mechanisms complicates the programming of distributed memory machines with a globally addressable memory because in these machines the user program has to explicitly issue communication
operations to reduce latency and enable cacheability. Manually generating these operations is cumbersome and may easily lead to errors. Compiler techniques to automatically generate the communication operations are needed. We have completed in Polaris the implementation of a first version of a parallelizing compiler for the Cray T3D, which has a global address space but not cache coherency. Our experimental results obtained on several codes from the Perfect and SPEC benchmarks proved that these techniques are quite effective.
Current parallelizing compilers cannot identify many parallelizable loops because their access patterns are complex or unknown at compile time. We have developed schemes for speculative parallel loop execution followed by a test for correct execution and for serial re-execution if the test fails. The test correctly identifies variables that can be privatized or that are involved in reduction operations, and handles them properly in the parallel loop execution. In addition, speculative run-time techniques for parallelization of WHILE loops and DO loops with conditional exits have been added to our framework for run-time parallelization. We have implemented these techniques for the Polaris compiler and are analyzing performance for a large set of application programs.
Extensive analysis of codes from the SPEC CFP95, Perfect Club Benchmark Suites, and Grand Challenge applications has, as expected, revealed the presence of a number of loops for which recurrences prevent DOALL parallelization. As part of our research in the automatic parallelization of FORTRAN programs, we are developing techniques to solve such recurrences in parallel or to overlap their serial solution with other computation using DO
ACROSS and other similar techniques. We are, for example, extending our current recognition and solution algorithms to handle additional classes of associative recurrences, such as min/max reductions.
Our goal is to develop a Fortran restructurer that generates efficient parallel code from ordinary sequential Fortran programs. We have developed several new translation techniques and implemented them in the Polaris system. Polaris is written in C++, which uses classes and class hierarchies to represent programs internally. By using classes, we can change the internal representation without modifying the program transformation modules, which reduces development costs. Preliminary results suggest that Polaris performs substantially better than PFA, the native parallelizer of the SGI machines. Code-generation modules for several shared memory machines and the Cray T3D are under development.
Though expensive, flow-sensitive interprocedural information significantly increases the accuracy of program analysis. To improve program analysis while lowering its cost, we are working to expand the use of gated SSA (GSA) to allow flow-sensitive program information to be computed on demand throughout Polaris and to implement the following techniques: power ranges, which determines if a variable is a power of 2; demand-driven substitution of interprocedural constants into loop nests; region aggregation and dependence tests to break most interprocedural dependence arcs; and selective inline expansion of subprogram statements involved in interprocedural dependence arcs that could not be broken using region dependence tests.
Complex systems are developed and evolved by groups, often separated in time and space and using disparate tools and techniques. No circumscribed tool suite can encompass all future needs. Instead, software support and collaboration environments must themselves evolve and grow to meet the needs of collaborative groups. As a solution to this problem, we are developing an extensible collaboration framework that supports the full range of activities undertaken by current team members, integrates the wide variety of tools and artifacts they use to realize their goals, and supports the evolution of the development team and its processes over time.
T
he complexity of parallel systems makes performance prediction difficult and experimental performance analysis crucial; performance data capture and data analysis environments are badly needed. We are developing a performance environment that allows data transformation modules to be interconnected in user-specified ways. The environment is written in C++ with the graphical displays based on X-windows and the Motif toolkit. The environment supports sonic data presentation and visual techniques. Performance data are captured by portable source code instrumentation. The performance instrumentation and analysis software are integrated with the Rice Fortran D and other HPF compilers to provide a seamless environment for data parallel program development and
tuning.
To support realistic applications, massively parallel computer systems must provide sufficient input/output (I/O) services that scale with computational power. We have conducted extensive simulation studies of parallel I/O architectures for large-scale shared memory parallel systems, with emphasis on the placement of data on disk arrays. We are extending this study to distributed memory parallel systems, investigating the interaction of file system design and potential performance by capturing and analyzing application input/output patterns, and studying mechanisms for real-time, performance-directed control of system behavior.
Software for a growing number of problem domains has complex time-varying behavior and unpredictable resource demands. While current performance analysis tools provide insights into application dynamics and the causes of poor performance, with a posteriori analysis one cannot adapt to temporally varying application resource demands and system responses. We believe that the solution to this performance optimization conundrum is integration of dynamic performance instrumentation and on-the-fly performance data reduction with real-time adaptive control mechanisms that select and configure resource management algorithms automatically, based on observed application behavior. We are developing a prototype, called Autopilot, to test this hypothesis with large-scale software systems.
Current software tools provide an increasingly myopic, unnecessarily limited window on the complex world of large-scale heterogeneous metacomputing, with few options to study application software structure or to modify behavior during execution. To address this problem, we are developing virtual environments that allow software developers to directly manipulate software components and their behavior while immersed in scalable, hierarchical representations of software structure and real-time performance data. Such a virtual environment will allow users to explore 3-D representations of executing software, interactively replace software components via physical manipulation, and modify software parameters.
We are using the Pablo performance analysis environment to instrument and analyze the input/output request patterns of a suite of scientific applications. The goal of this work is to identify common input/output access patterns and both hardware and system software bottlenecks, a precursor to alleviating the current input/output bottleneck on many parallel systems.
This is an interdisciplinary project to investigate and develop strategies for efficient implementation of I/O intensive applications in computational science and engineering. We have formed a team of computer scientists and applications scientists to characterize the I/O of specific application programs running on large massively parallel computers, define application-level methodologies for efficient parallel I/O, and implement and test application-level I/O tools on large-scale computations. The scalable I/O problem is broad, and this project deals with only one part of the problem, namely the characterization of applications and the definition of application level I/O tools.
We are developing a user-level input/output library that will be portable across several parallel systems, permitting a wide range of experimentation with modest human resources. This input/output library is invoked by user input/output calls and relies on the underlying system software for physical input/output. Interposing the libraries between the application and the system software allows us to quickly experiment with a variety of data distribution and data management algorithms. The system is operational, and we are conducting application performance experiments.
The Scalable I/O (SIO) Initiative is launching a coordinated attack on the biggest obstacle to effective use of teraFLOPS-scale computer systems by scientists and engineers getting data into, out of, and around such systems fast enough to avoid severe bottlenecks. The SIO Initiative will determine applications' requirements and use them to guide development of new programming language features, compiler techniques, system support services, file storage facilities, and high-performance networking software. The result will be implemented on two testbeds parallel with full-scale applications. Technology transfer will be accelerated by directly involving MPP vendors.