Increasing register file immunity to transient errors

Gokhan Memik, Mahmut T. Kandemir, Ozean Ozturk

Research output: Chapter in Book/Report/Conference proceedingConference contribution

52 Scopus citations

Abstract

Transient errors are one of the major reasons for system downtime in many systems. While prior research has mainly focused on the impact of transient errors on datapath, caches and main memories, the register file has largely been neglected. Since the register file is accessed very frequently, the probability of transient errors is high. In addition, errors in it can quickly spread to different parts of the system, and cause application crash or silent data corruption. This paper addresses the reliability of register files in superscalar processors. Particularly, we propose to duplicate actively used physical registers in unused physical registers. The rationale behind this idea is that if the protection mechanism (parity or ECC) used for the primary copy indicates an error, the duplicate can provide the data as long as it is not corrupted. We implement two types of strategies based on this register duplication idea. In the "conservative strategy," we limit ourselves with the given register usage behavior, and duplicate register contents only on otherwise unused registers. Consequently, there is no impact on the original performance when there is no error, except for the protection mechanism used for the primary copy. Our experiments with two different versions of this strategy show that, with the more powerful conservative scheme, 78% of the accesses are to the physical registers with duplicates The "aggressive strategy" sacrifices some performance to increase the number of register accesses with duplicates. It does so by marking the registers not used for a long time as "dead" and using them for duplicating actively used registers. The experiments with this strategy indicate that it takes the fraction of the reliable register accesses to 84%, and degrades the overall performance by only 0.21% on the average.

Original languageEnglish (US)
Title of host publicationProceedings - Design, Automation and Test in Europe, DATE '05
Pages586-591
Number of pages6
DOIs
StatePublished - 2005
EventDesign, Automation and Test in Europe, DATE '05 - Munich, Germany
Duration: Mar 7 2005Mar 11 2005

Publication series

NameProceedings -Design, Automation and Test in Europe, DATE '05
VolumeI
ISSN (Print)1530-1591

Other

OtherDesign, Automation and Test in Europe, DATE '05
Country/TerritoryGermany
CityMunich
Period3/7/053/11/05

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Increasing register file immunity to transient errors'. Together they form a unique fingerprint.

Cite this