Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern supersca...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the annual International Symposium on Microarchitecture pp. 182 - 193
Main Authors	Chi-Keung Luk, Mowry, T.C.
Format	Conference Proceeding Journal Article
Language	English
Published	IEEE 1998
Subjects	Application software Computer science Delay Ear Electronic switching systems Filtering Hardware National electric code Prefetching Read only memory
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor; which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.
Bibliography:	SourceType-Scholarly Journals-2 ObjectType-Feature-2 ObjectType-Conference Paper-1 content type line 23 SourceType-Conference Papers & Proceedings-1 ObjectType-Article-3
ISBN:	9780818686092 081868609X
ISSN:	1072-4451
DOI:	10.1109/MICRO.1998.742780