This paper presents the
Compute Cache architecture that enables in-place computation in caches. Compute
Caches uses emerging bit-line SRAM circuit technology to repurpose existing
cache elements and transforms them into active very large vector computational
units. Also, it significantly reduces the overheads in moving data between
different levels in the cache hierarchy.
Compute Caches increase
performance by 1.9× and reduce energy by 2.4× for a suite of data-centric
applications, including text and database query processing, cryptographic
kernels, and in-memory check pointing. Applications with larger fraction of
Compute Cache operations could benefit even more, as our micro-benchmarks
indicate (54× throughput, 9× dynamic energy savings).
Much prior work
has studied cache replacement, but a large gap remains between theory and
practice. The design of many practical policies is guided by the optimal
policy, Belady’s MIN. The practical policies should replace lines based on
their economic value added (EVA), the difference of their expected hits from
the average. Drawing on the theory of Markov decision processes, discuss why
this metric maximizes the cache’s hit rate. An inexpensive implementation of
EVA and evaluate it exhaustively. EVA outperforms several prior policies and
saves area at iso-performance. These results show that formalizing cache
replacement yields practical benefits.
As computing today is
dominated by data-centric applications, there is a strong impetus for
specialization for this important domain. Conventional processors narrow vector
units fail to exploit the high degree of data-parallelism
these applications. Also, they expend is proportionately large fraction
of time and energy in moving data over
cache hierarchy, and in instruction processing, as compared to the actual
The Compute Cache architecture for dramatically reducing
these inefficiencies through in-place processing in caches. A modern processor
devotes a large fraction of die area to caches which are used for storing and
retrieving data. Our key idea is to re-purpose and transform the elements used
in caches into active computational units. This enables computation in-place
within a cache sub-array, without transferring data in or out of it. Such a
transformation can unlock massive data-parallel compute capabilities,
dramatically reduce energy spent in data movement over the cache hierarchy, and
thereby directly address the needs of data-centric applications.
caches consume significant resources, often over 50% of chip area 24, so it
is crucial to manage them efficiently. Prior work has approached cache
replacement from both theoretical and practical standpoints. Unfortunately,
there is a large gap between theory and practice. Theory-based policies account for uncertainty by using a
simplified, statistical model of the reference stream in which the optimal
policy can be solved for.
these policies generally perform poorly compared to empirical designs. The key
challenge faced by theoretical policies is choosing their underlying
statistical model. This model should capture enough information about the
access stream to make good replacement decisions, yet must be simple enough to
analyze. For example, some prior work uses the independent reference model
(IRM), which assumes that candidates are referenced independently with static,
known probabilities. In this model, the optimal policy is to evict the
candidate with the lowest reference probability, i.e., LFU. Though useful in
other areas (e.g., web caches), the IRM is inadequate for processor caches
because it assumes that reference probabilities do not change over time.
Compute Caches arises from two main sources: massive parallelism and reduced
data movement. A cache is typically organized as a set of sub-arrays; as many as
hundreds of sub-arrays, depending on the cache level.
An important problem in using Compute Caches is satisfying
the operand locality constraint.
Bit-line computing requires that the data operands are stored in rows that
share the same set of bit-lines. A cache geometry is architecture, where ways
in a set are judiciously mapped to a sub-array, so that software can easily
satisfy operand locality. Our design allows a compiler to ensure operand
locality simply by placing operands at addresses that are page aligned (same page
offset). It avoids exposing the internals of a cache, such as its size or
geometry, to software.
This paper seeks to bridge theory and practice. Here a
principled approach that builds on insights from recent empirical designs.
First, the policies should replace candidates by their economic value added
(EVA); i.e., how many more hits one expects from each candidate vs. the average
candidate. Second, a design to practical implementation of this policy and show
it outperforms existing policies.
The two relevant
trends in recent research. First, most empirical policies exploit dynamic behaviour
to select a victim, most commonly by using the recency and frequency heuristics.
Most policies employ some form of recency, favouring candidates that were
referenced recently: e.g., LRU uses recency alone, and RRIP predicts a longer
time until reference for older candidates. Similarly, a few policies that do
not assume recency still base their policy on when a candidate was last
referenced: PDP protects candidates until a certain age; and IRGD uses a
heuristic function of ages.
way policies expoit dynamic behaviour is through frequency, favouring
candidates that were previously reused: e.g., LFU uses frequency alone, and
“scan-resistant” policies like ARC and SRRIP favour candidates that have been
reused at least once. Second, recent high-performance policies adapt themselves
to the access stream to varying degrees. DIP detects thrashing with set
dueling, and thereafter inserts most lines at LRU to prevent thrashing. DRRIP
inserts lines at medium priority, promoting them only upon reuse, and avoids
thrashing using the same mechanism as DIP. SHiP extends DRRIP by adapting the
insertion priority based on the memory address, PC, or instruction sequence.
Likewise, SDBP, PRP, and Hawkeye learn the behavior of different PCs. And PDP
and IRGD use auxiliary monitors to profile the access pattern and periodically
recompute their policy.
These two trends
show that (i) candidates reveal important information over time, and (ii) policies
should learn from this information by adapting themselves to the access
pattern. But these policies do not make the best use of the information they
capture, and prior theory does not suggest the right policy. Planning theory to
design a practical policy that addresses these issues. EVA is intuitive and
inexpensive to implement. In contrast to most empirical policies, EVA does not
explicitly encode particular heuristics (e.g., frequency). Rather, it is a
general approach that aims to make the best use of limited information, so that
prior heuristics arise naturally when appropriate.
Since it is inadequate to consider either hit probability or
time until reference alone, there must be some way to reconcile them in a
single metric. In general, the optimal metric should satisfy three properties:
(i) it considers only future behaviour, since the past is a sunk cost; (ii) it
prefers candidates that are more likely to hit; and (iii) it penalizes
candidates that take longer to hit. These properties both maximize the hit
probability and minimize time spent in the cache, as desired. To achieve these
properties by viewing time spent in the cache as forgone hits, i.e., as the
opportunity cost of retaining lines. Thus rank candidates by their economic
value added (EVA), or how many hits the candidate yields over the “average candidate”.
EVA is essentially a cost-benefit analysis about whether a candidate’s odds of
hitting are worth the cache space it will consume.
WHY EVA IS
THE RIGHT METRIC?
The previous section described and
motivated EVA. This section discusses why EVA is the right metric to maximize cache
performance under uncertainty. First presenting a naïve metric that intuitively
maximizes the hit rate, but show that it unfortunately cannot capture long-run behaviour.
Fortunately, prior work in Markov decision processes (MDPs) has encountered and
solved this problem, and show how EVA adapts the MDP solution to cache
Naive metric: The need is for a replacement metric that
maximizes the cache’s hit rate. With perfect knowledge, MIN achieves this by
greedily “buying hits as cheaply as possible,” i.e. by keeping the candidates
that are referenced in the fewest accesses. Stated another way, MIN retains the
candidates that get the most hits per-access. Therefore, a simple metric that
might generalize MIN is to predict each candidate’s hits-per-access, or hit
Ehit rate = lim
Expected hits after T accesses
and retain the candidates with the highest hit rate.
Intuitively, keeping these candidates over many replacements tends to maximize
the cache’s hit rate. The problem: Eq. 7 suffices when considering a single
lifetime, but it cannot account for candidates’ long-run behavior over many
lifetimes: To estimate the future hit rate, it is necessary to compute expected
hits over arbitrarily many future accesses. However, as cache lines are
replaced many times, they tend to converge to average behaviour because their
replacements are generally unrelated to their original contents. Hence, all
candidates’ hit rates converge in the limit. In fact, all candidates’ hit rates
Solving this equation as T ??
The cache’s per-line hit rate, for all ages and classes.
In other words, Eq. 7 loses its discriminating power over long
time horizons, degenerating to random replacement. So while estimating
candidates’ hit rates is intuitive, this approach is fundamentally flawed as a
replacement metric. The solution in a nutshell: EVA sidesteps this problem by changing
the question. Instead of asking which candidate gets a higher hit rate?, EVA
asks which candidate gets more hits?
The idea is that Eq. 7 compares the long-run hit rates of retaining
SHiP and PDP improve performance by correcting LRU’s flaws
on particular access patterns. Both perform well on libquantum (a scanning benchmark), sphinx3,
and xalancbmk. However, their performance varies considerably across
apps. For example, SHiP performs particularly well on perlbench, mcf, and cactusADM.
PDP performs particularly well on GemsFDTD and
lbm, where SHiP exhibits pathologies and performs similar to
random replacement. EVA matches or outperforms SHiP and PDP on most apps and
cache sizes. This is because EVA generally maximizes upon available
information, so the right replacement strategies naturally emerge where
appropriate. As a result, EVA successfully captures the benefits of SHiP and
PDP within a common framework, and sometimes outperforms both. Since EVA
performs consistently well, and SHiP and PDP do not, EVA achieves the lowest
MPKI of all policies on average.
The cases where EVA performs slightly worse arise for two
reasons. First, in some cases (e.g., mcf at
1MB), the access pattern changes significantly between policy updates. EVA can
take several updates to adapt to the new pattern, during which performance
suffers. But in most cases the access pattern changes slowly, and EVA performs
well. Second, our implementation coarsens ages, which can cause small
performance variability for some apps (e.g., libquantum).
EVA edges closer
to optimal replacement: compares the practical policies
against MIN, showing the average gap over MIN across the most memory-intensive
SPEC CPU2006 apps2—i.e., each policy’s MPKI minus MIN’s at equal area. One
would expect a practical policy to fall somewhere between random replacement
(no information) and MIN (perfect information). But LRU actually performs worse than
random at many sizes because private
caches strip out most temporal locality before it reaches the LLC, leaving
scanning patterns that are pathological in LRU. In contrast, both SHiP and PDP
significantly outperform random replacement. Finally, EVA performs best. On
average, EVA closes 57% of the random- MIN MPKI gap. In comparison, DRRIP (not
shown) closes 41%,
SHiP 47%, PDP 42%, and LRU –9%.
EVA saves cache
improves performance, it needs less cache space than other policies to achieve
a given level of performance. Fig. 12 shows the iso-MPKI total cache area of
each policy, i.e., the area required to match random replacement’s average MPKI
for different LLC sizes (lower is better). For example, a 21.5mm2 EVA cache
achieves the same MPKI as a 4MB cache using random replacement, whereas SHiP
needs 23.6mm2 to match this performance. EVA is the most area efficient over
the full range. On average, EVA saves 8% total cache area over SHiP, the best
practical alternative. However, note that MIN saves 35% over EVA, so there is
still room for improvement, though some performance gap is unavoidable due to
the costs of uncertainty. EVA achieves the best end-to-end performance: shows
the IPC speedups over random replacement at 35mm2, the area of a 4MB LLC with
random replacement. Only benchmarks that are sensitive to replacement are
shown, i.e., benchmarks whose IPC changes by at least 1% under some policy. EVA
achieves consistently good speedups across apps, whereas prior policies do not.
SHiP performs poorly on xalancbmk, sphinx3, and lbm, and PDP performs poorly on
mcf and cactusADM. Consequently, EVA achieves the best speedup overall. Gmean
speedups on sensitive apps (those shown) are for EVA 8.5%, DRRIP (not shown)
6.7%, SHiP 6.8%, PDP 4.5%, and LRU –2.3%. EVA makes good use of additional
state: sweeps the number of tag bits for different policies and plots their
average MPKI at 4MB. (This experiment is not iso-area.) The figure shows the
best configuration on the right; EVA and PDP use idealized, large timestamps.
Prior policies achieve peak performance with 2 or 3 bits, after which their
performance flattens or even degrades.Unlike prior policies, EVA’s performance
improves steadily with more state, and its peak performance exceeds prior policies
by a good margin. With 2 bits, EVA performs better than PDP, similar to DRRIP,
and slightly worse than SHiP.
best configurations, EVA’s improvement over SHiP is 1.8× greater than SHiP’s
improvement over DRRIP. EVA with 8 btags performs as well as an idealized
implementation, yet still adds small overheads. These overheads more than pay
for themselves, saving area at iso-performance. With very few age bits, no
single choice of age granularity A works well for all applications. To make EVA
perform well with few bits, software adapts the age granularity using a simple
heuristic: if more than 10% of hits and evictions occur at the maximum age,
then increase A by one; otherwise, if less than 10% of hits occur in the second
half of ages, then decrease A by one. This heuristic rapidly converges to the
right age granularity across all evaluated applications. Only uses this
heuristic, and it is disabled for all other results. Since EVA is most
area-efficient with larger tags, the design advocated (8 b tags) does not
employ this heuristic.
The key challenge faced by practical replacement policies
is how to cope with uncertainty. Simple approaches like predicting time until
reference are flawed. It has been argued for replacement by economic value
added (EVA), starting from first principles and drawing from prior planning
theory. It is further showed that EVA can be implemented with trivial hardware,
and that it outperforms existing high-performance policies nearly uniformly on
single and multi-threaded benchmarks.
Xplore Digital Library
on Maximizing Cache Performance Under Uncertainty by Nathan Beckmann and Daniel
Paper on Compute Caches by Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish
Narayanasamy, David Blaauw, and Reetuparna Das