In the organization of the cache hierarchy we have shown possible ways of reducing the latency of the cache misses
and a couple of proposals that attempts to improve the use of the cache storage.
On these matrices, the left-looking algorithm incurs more level-1 cache misses but fewer level-2 misses.
This section presents a theoretical analysis of cache misses in left-looking and multifrontal sparse Cholesky factorization algorithms.
Analyzing both inter- and intea-supernode cache misses simultaneously is hard, because sparse-matrix factorizations are irregular computations.
We measure cache efficiency by data reuse ratio, which is the ratio of the total number of operations that an algorithm performs to the number of cache misses that it generates.
For every element in the update matrix of j, the algorithm may generate up to 3 cache misses to read the value of the element and its row and column indices, and up to 4 cache misses to update a destination element in the frontal matrix of the parent.
The number of cache misses and floating-point operations associated with reading elements of A and adding them to the frontal matrix is the same, and bounded by O ([[lambda].
j]) cache misses generated by reading elements of A, writing elements of L, and updating the parent of j.
In most designs directory entries are stored in main memory, together with the memory lines they are associated with, which puts the cycles needed to access main memory into the critical path of cache misses.
In , cache misses found in cc-NUMA multiprocessors are firstly classified in terms of the actions performed by directories to satisfy them, and then, it is proposed a novel node architecture that makes extensive use of on-processor-chip integration in order to reduce the latency of each one of the types of the classification.
The hierarchical nature of its design and its limited scale make it feasible to use simple interconnects, such as a crossbar switch, to connect the handful of nodes, allowing a more efficient handling of certain cache misses than traditional directory-based multiprocessors by exploiting the extra ordering properties of the switch.
2] use trace-driven simulation to study cache interference on the context of coarse-grained multithreading, which executes a single thread and switches to other thread on L2 cache misses.
In their experiments they find that the wasted issue cycles due to instruction cache misses grows from 1% at one thread to 14% at 8 threads in a multiprogrammed workload.
The instruction cache support is user-transparent, provides hardware cache management, and implements no access penalty for cache misses
Other high-end features include a fully associative level three cache to improve performance on level two cache misses
and a processor agility feature that allows Pentiums of different clock speeds to be installed and used simultaneously, providing users with a true growth path and investment savings as faster Pentiums become available.