Abstract-This paper presents theCompute Cache architecture that enables in-place computation in caches.
ComputeCaches uses emerging bit-line SRAM circuit technology to repurpose existingcache elements and transforms them into active very large vector computationalunits. Also, it significantly reduces the overheads in moving data betweendifferent levels in the cache hierarchy.Compute Caches increaseperformance by 1.9× and reduce energy by 2.
4× for a suite of data-centricapplications, including text and database query processing, cryptographickernels, and in-memory check pointing. Applications with larger fraction ofCompute Cache operations could benefit even more, as our micro-benchmarksindicate (54× throughput, 9× dynamic energy savings).Much prior workhas studied cache replacement, but a large gap remains between theory andpractice. The design of many practical policies is guided by the optimalpolicy, Belady’s MIN. The practical policies should replace lines based ontheir economic value added (EVA), the difference of their expected hits fromthe average. Drawing on the theory of Markov decision processes, discuss whythis metric maximizes the cache’s hit rate.
An inexpensive implementation ofEVA and evaluate it exhaustively. EVA outperforms several prior policies andsaves area at iso-performance. These results show that formalizing cachereplacement yields practical benefits. Introduction-As computing today isdominated by data-centric applications, there is a strong impetus forspecialization for this important domain. Conventional processors narrow vectorunits fail to exploit the high degree of data-parallelismthese applications.
Also, they expend is proportionately large fractionof time and energy in moving data overcache hierarchy, and in instruction processing, as compared to the actualcomputation.The Compute Cache architecture for dramatically reducingthese inefficiencies through in-place processing in caches. A modern processordevotes a large fraction of die area to caches which are used for storing andretrieving data. Our key idea is to re-purpose and transform the elements usedin caches into active computational units. This enables computation in-placewithin a cache sub-array, without transferring data in or out of it. Such atransformation can unlock massive data-parallel compute capabilities,dramatically reduce energy spent in data movement over the cache hierarchy, andthereby directly address the needs of data-centric applications.
Last-levelcaches consume significant resources, often over 50% of chip area 24, so itis crucial to manage them efficiently. Prior work has approached cachereplacement from both theoretical and practical standpoints. Unfortunately,there is a large gap between theory and practice. Theory-based policies account for uncertainty by using asimplified, statistical model of the reference stream in which the optimalpolicy can be solved for.
Unfortunately,these policies generally perform poorly compared to empirical designs. The keychallenge faced by theoretical policies is choosing their underlyingstatistical model. This model should capture enough information about theaccess stream to make good replacement decisions, yet must be simple enough toanalyze. For example, some prior work uses the independent reference model(IRM), which assumes that candidates are referenced independently with static,known probabilities. In this model, the optimal policy is to evict thecandidate with the lowest reference probability, i.e., LFU. Though useful inother areas (e.
g., web caches), the IRM is inadequate for processor cachesbecause it assumes that reference probabilities do not change over time. Efficiency ofCompute Caches arises from two main sources: massive parallelism and reduceddata movement. A cache is typically organized as a set of sub-arrays; as many ashundreds of sub-arrays, depending on the cache level. An important problem in using Compute Caches is satisfyingthe operand locality constraint.Bit-line computing requires that the data operands are stored in rows thatshare the same set of bit-lines.
A cache geometry is architecture, where waysin a set are judiciously mapped to a sub-array, so that software can easilysatisfy operand locality. Our design allows a compiler to ensure operandlocality simply by placing operands at addresses that are page aligned (same pageoffset). It avoids exposing the internals of a cache, such as its size orgeometry, to software. This paper seeks to bridge theory and practice. Here aprincipled approach that builds on insights from recent empirical designs.First, the policies should replace candidates by their economic value added(EVA); i.e.
, how many more hits one expects from each candidate vs. the averagecandidate. Second, a design to practical implementation of this policy and showit outperforms existing policies. Survey-The two relevanttrends in recent research. First, most empirical policies exploit dynamic behaviourto select a victim, most commonly by using the recency and frequency heuristics.Most policies employ some form of recency, favouring candidates that werereferenced recently: e.g., LRU uses recency alone, and RRIP predicts a longertime until reference for older candidates.
Similarly, a few policies that donot assume recency still base their policy on when a candidate was lastreferenced: PDP protects candidates until a certain age; and IRGD uses aheuristic function of ages. Another commonway policies expoit dynamic behaviour is through frequency, favouringcandidates that were previously reused: e.g., LFU uses frequency alone, and”scan-resistant” policies like ARC and SRRIP favour candidates that have beenreused at least once. Second, recent high-performance policies adapt themselvesto the access stream to varying degrees. DIP detects thrashing with setdueling, and thereafter inserts most lines at LRU to prevent thrashing. DRRIPinserts lines at medium priority, promoting them only upon reuse, and avoidsthrashing using the same mechanism as DIP. SHiP extends DRRIP by adapting theinsertion priority based on the memory address, PC, or instruction sequence.
Likewise, SDBP, PRP, and Hawkeye learn the behavior of different PCs. And PDPand IRGD use auxiliary monitors to profile the access pattern and periodicallyrecompute their policy. These two trendsshow that (i) candidates reveal important information over time, and (ii) policiesshould learn from this information by adapting themselves to the accesspattern. But these policies do not make the best use of the information theycapture, and prior theory does not suggest the right policy. Planning theory todesign a practical policy that addresses these issues. EVA is intuitive andinexpensive to implement.
In contrast to most empirical policies, EVA does notexplicitly encode particular heuristics (e.g., frequency). Rather, it is ageneral approach that aims to make the best use of limited information, so thatprior heuristics arise naturally when appropriate.
EVAREPLACEMENT POLICY Since it is inadequate to consider either hit probability ortime until reference alone, there must be some way to reconcile them in asingle metric. In general, the optimal metric should satisfy three properties:(i) it considers only future behaviour, since the past is a sunk cost; (ii) itprefers candidates that are more likely to hit; and (iii) it penalizescandidates that take longer to hit. These properties both maximize the hitprobability and minimize time spent in the cache, as desired.
To achieve theseproperties by viewing time spent in the cache as forgone hits, i.e., as theopportunity cost of retaining lines. Thus rank candidates by their economicvalue added (EVA), or how many hits the candidate yields over the “average candidate”.
EVA is essentially a cost-benefit analysis about whether a candidate’s odds ofhitting are worth the cache space it will consume. WHY EVA ISTHE RIGHT METRIC? The previous section described andmotivated EVA. This section discusses why EVA is the right metric to maximize cacheperformance under uncertainty. First presenting a naïve metric that intuitivelymaximizes the hit rate, but show that it unfortunately cannot capture long-run behaviour.Fortunately, prior work in Markov decision processes (MDPs) has encountered andsolved this problem, and show how EVA adapts the MDP solution to cachereplacement.Naive metric: The need is for a replacement metric thatmaximizes the cache’s hit rate. With perfect knowledge, MIN achieves this bygreedily “buying hits as cheaply as possible,” i.
e. by keeping the candidatesthat are referenced in the fewest accesses. Stated another way, MIN retains thecandidates that get the most hits per-access.
Therefore, a simple metric thatmight generalize MIN is to predict each candidate’s hits-per-access, or hitrate: Ehit rate = limT?? Expected hits after T accessesTand retain the candidates with the highest hit rate.Intuitively, keeping these candidates over many replacements tends to maximizethe cache’s hit rate. The problem: Eq.
7 suffices when considering a singlelifetime, but it cannot account for candidates’ long-run behavior over manylifetimes: To estimate the future hit rate, it is necessary to compute expectedhits over arbitrarily many future accesses. However, as cache lines arereplaced many times, they tend to converge to average behaviour because theirreplacements are generally unrelated to their original contents. Hence, allcandidates’ hit rates converge in the limit. In fact, all candidates’ hit ratesare identical:Solving this equation as T ??yields h/N,The cache’s per-line hit rate, for all ages and classes. In other words, Eq. 7 loses its discriminating power over longtime horizons, degenerating to random replacement. So while estimatingcandidates’ hit rates is intuitive, this approach is fundamentally flawed as areplacement metric.
The solution in a nutshell: EVA sidesteps this problem by changingthe question. Instead of asking which candidate gets a higher hit rate?, EVAasks which candidate gets more hits? The idea is that Eq. 7 compares the long-run hit rates of retainingdifferent candidates: limT??hits1(T)T?>limT??hits2(T)T EVA performsconsistently well: SHiP and PDP improve performance by correcting LRU’s flawson particular access patterns. Both perform well on libquantum (a scanning benchmark), sphinx3,and xalancbmk. However, their performance varies considerably acrossapps. For example, SHiP performs particularly well on perlbench, mcf, and cactusADM.PDP performs particularly well on GemsFDTD andlbm, where SHiP exhibits pathologies and performs similar torandom replacement.
EVA matches or outperforms SHiP and PDP on most apps andcache sizes. This is because EVA generally maximizes upon availableinformation, so the right replacement strategies naturally emerge whereappropriate. As a result, EVA successfully captures the benefits of SHiP andPDP within a common framework, and sometimes outperforms both. Since EVAperforms consistently well, and SHiP and PDP do not, EVA achieves the lowestMPKI of all policies on average. The cases where EVA performs slightly worse arise for tworeasons.
First, in some cases (e.g., mcf at1MB), the access pattern changes significantly between policy updates. EVA cantake several updates to adapt to the new pattern, during which performancesuffers. But in most cases the access pattern changes slowly, and EVA performswell. Second, our implementation coarsens ages, which can cause smallperformance variability for some apps (e.g.
, libquantum).EVA edges closerto optimal replacement: compares the practical policiesagainst MIN, showing the average gap over MIN across the most memory-intensiveSPEC CPU2006 apps2—i.e., each policy’s MPKI minus MIN’s at equal area. Onewould expect a practical policy to fall somewhere between random replacement(no information) and MIN (perfect information). But LRU actually performs worse thanrandom at many sizes because privatecaches strip out most temporal locality before it reaches the LLC, leavingscanning patterns that are pathological in LRU. In contrast, both SHiP and PDPsignificantly outperform random replacement. Finally, EVA performs best.
Onaverage, EVA closes 57% of the random- MIN MPKI gap. In comparison, DRRIP (notshown) closes 41%,SHiP 47%, PDP 42%, and LRU –9%. EVA saves cachespace: Because EVAimproves performance, it needs less cache space than other policies to achievea given level of performance.
Fig. 12 shows the iso-MPKI total cache area ofeach policy, i.e., the area required to match random replacement’s average MPKIfor different LLC sizes (lower is better). For example, a 21.5mm2 EVA cacheachieves the same MPKI as a 4MB cache using random replacement, whereas SHiPneeds 23.6mm2 to match this performance.
EVA is the most area efficient overthe full range. On average, EVA saves 8% total cache area over SHiP, the bestpractical alternative. However, note that MIN saves 35% over EVA, so there isstill room for improvement, though some performance gap is unavoidable due tothe costs of uncertainty.
EVA achieves the best end-to-end performance: showsthe IPC speedups over random replacement at 35mm2, the area of a 4MB LLC withrandom replacement. Only benchmarks that are sensitive to replacement areshown, i.e., benchmarks whose IPC changes by at least 1% under some policy. EVAachieves consistently good speedups across apps, whereas prior policies do not.
SHiP performs poorly on xalancbmk, sphinx3, and lbm, and PDP performs poorly onmcf and cactusADM. Consequently, EVA achieves the best speedup overall. Gmeanspeedups on sensitive apps (those shown) are for EVA 8.
5%, DRRIP (not shown)6.7%, SHiP 6.8%, PDP 4.5%, and LRU –2.3%. EVA makes good use of additionalstate: sweeps the number of tag bits for different policies and plots theiraverage MPKI at 4MB.
(This experiment is not iso-area.) The figure shows thebest configuration on the right; EVA and PDP use idealized, large timestamps.Prior policies achieve peak performance with 2 or 3 bits, after which theirperformance flattens or even degrades.Unlike prior policies, EVA’s performanceimproves steadily with more state, and its peak performance exceeds prior policiesby a good margin. With 2 bits, EVA performs better than PDP, similar to DRRIP,and slightly worse than SHiP. Comparing thebest configurations, EVA’s improvement over SHiP is 1.8× greater than SHiP’simprovement over DRRIP. EVA with 8 btags performs as well as an idealizedimplementation, yet still adds small overheads.
These overheads more than payfor themselves, saving area at iso-performance. With very few age bits, nosingle choice of age granularity A works well for all applications. To make EVAperform well with few bits, software adapts the age granularity using a simpleheuristic: if more than 10% of hits and evictions occur at the maximum age,then increase A by one; otherwise, if less than 10% of hits occur in the secondhalf of ages, then decrease A by one. This heuristic rapidly converges to theright age granularity across all evaluated applications.
Only uses thisheuristic, and it is disabled for all other results. Since EVA is mostarea-efficient with larger tags, the design advocated (8 b tags) does notemploy this heuristic. CONCLUSION The key challenge faced by practical replacement policiesis how to cope with uncertainty. Simple approaches like predicting time untilreference are flawed. It has been argued for replacement by economic valueadded (EVA), starting from first principles and drawing from prior planningtheory.
It is further showed that EVA can be implemented with trivial hardware,and that it outperforms existing high-performance policies nearly uniformly onsingle and multi-threaded benchmarks. REFERANCE · IEEEXplore Digital Library· Paperon Maximizing Cache Performance Under Uncertainty by Nathan Beckmann and DanielSanchez· Paper on Compute Caches by Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, SatishNarayanasamy, David Blaauw, and Reetuparna Das· Wikipedia