University Of Tasmania
2 files

Superscalar performance in a multi threaded microprocessor

Download all (9.97 MB)
posted on 2023-05-26, 03:18 authored by Gunther, BK
Multithreaded processors, having hardware support for the concurrent execution of fine-grained threaded computations, are noted for their latency tolerance and low-cost synchronization. Multithreading is a technique for improving the utilization of processing elements (PEs) in parallel processing systems, thereby reducing cost/performance ratios. With increasing integrated circuit densities it is becoming feasible to integrate several PEs onto a single die, and further diminish the physical dimensions of parallel systems. However, by eliminating the artificial on-chip PE boundaries and sharing expensive resources in a more tightly coupled multithreaded architecture, even greater performance can be achieved from similar hardware. A multithreaded processor architecture (Concurro) was designed for possible microprocessor implementation with the objective of multiple instruction issues per cycle-sustained superscalar performance-by means of multithreading. This thesis considers the trade-offs necessary for such architectures to achieve high throughput and hardware utilization under scalability and cost constraints. A detailed simulation study was carried out to characterize the architecture and evaluate the impact of implementation decisions. The key to efficiency in Concurro is asynchronous, zero-time context switching among a limited set of contexts, promoting effective use of the storage hierarchy. A 64-bit, register-based, load/store instruction set architecture is augmented with thread manipulation primitives and !structure synchronization operations. Novel cache architectures and controller algorithms were designed for enhancing latency tolerance in the processor, while maximizing utilization of the most costly resources. When tested on a variety of numerical and integer workloads, Concurro was able to sustain superscalar instruction issue rates for multithreaded operation, yet showed scalar RISC performance on single-thread code. Even with a simple threading strategy it was frequently possible to extract full utilization from functional units or the instruction cache. The architecture showed size scalability to an order of magnitude while remaining binary compatible across these configurations. Performance of large configurations was shown to be limited ultimately by the bandwidth available from critical shared resources. With an appropriate memory system Concurro attained supercomputer-level floating point throughput operating out of uncached memory. The hardware requirements for this performance are expected to be comparable with those ofVLIW machines with similar datapaths.


Publication status

  • Unpublished

Rights statement

Copyright the author - The University is continuing to endeavour to trace the copyright owner(s) and in the meantime this item has been reproduced here in good faith. We would be pleased to hear from the copyright owner(s).

Repository Status

  • Open

Usage metrics

    Thesis collection


    No categories selected