New applications like multimedia systems and the benefits of high-performance microprocessors and buses are limited in current personal computers by existing storage device technology. Low-cost, high-density, secondary storage devices such as 1.5 in. hard disks, flash and PCMCIA cards, and optical disk drives, allow innovative I/O subsystem designs that can help to match the performance of powerful microprocessors and high-bandwidth buses with the new storage technologies. Example I/O subsystem designs include disk arrays and log file systems. This research is examining the design trade-offs involved in the I/O subsystems.
Memory hierarchies are used by multiprocessor systems to reduce large memory access times. However, even with tuned memory hierarchies, large machines can waste a lot of time in memory hierarchy misses. A good technique to reduce this waste is data prefetching. In this technique, specialized hardware and software support brings data close to the processor in advance, before the processor actually needs the data. The goal is to overlap the fetching of these data with other computation and therefore waste no time waiting for the data. In this research, we perform a realistic study to find out the potential for data prefetching in numerical codes. We also design hardware support for prefetching.
Fast process synchronization is critical to the performance of large-scale shared memory multiprocessors. The fastest way to support synchronization is to provide special-purpose hardware. Examples of such hardware are the fetch-and-phi operations of the NYU Ultracomputer and IBM
RP3, the full/empty bit of the HEP and Tera Computer machines, or the QOSB primitives of the Wisconsin Multicube. Cedar provides the most complete set of hardware-supported synchronization operations thanks to a special synchronization processor. We evaluate some of this hardware. We also analyze advanced algorithms to minimize interprocessor communication when processors synchronize in scalable shared memory machines.
Scalable shared memory multiprocessors are a popular approach to provide large-scale computing power while maintaining programmability. What makes the base shared memory paradigm attractive is the simplicity of the programming model: memory is shared by all processors. In this project, we design the I-ACOMA multiprocessor, a new scalable shared memory multiprocessor. The issues that are being investigated include advanced cache coherence protocols, compiler support for the protocols, support for data prefetching and data transfer optimizations, and other memory hierarchy improvements.
Good cache memory performance is essential to achieving high CPU utilization in shared memory multiprocessors. While the performance of caches is determined by both application and operating system references, most research has focused on the cache performance of applications alone. This is partially because of the difficulty of measuring operating system activity and, as a result, the cache performance of the operating system is barely known. In this research, we characterize the instruction and data cache performance of the UNIX-based operating system running on the Cedar multiprocessor. Our goal is to measure the cache interference between operating system and application activity.