A Brief History of Cache

Cache through the lens of history

Bill Tran
5 min readMay 5, 2021

Imagine you are in the middle of a Math exam where, to your surprise, you come across a really familiar question, which you think you have seen in the practice test. “Can’t be that easy,” you think; checking the question data and numbers, you find out that it’s exactly the question you have practiced, not even a word different. You go ahead and put the answer you memorize into the answer sheet without having to do the problem all over again.

This scenario illustrates the mechanism by which caching works: storing known data somewhere for faster access.

I. Historical Context

Processor — Memory gap through the years. Source: https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/

CPUs have always been faster than memories. The imbalance has been preserved ever since, as both CPUs and memories have been enhanced over time. As it becomes possible to place more and more circuits on a chip over time, CPUs get even faster as engineers are using these new capabilities for pipelining and superscalar operations. Memory designers, on the other hand, tend to use new technology advance to increase the capacity of chips, not the speed, making the gap larger and larger. What effects does this hardware phenomenon impose in practice?

The problem is, when a CPU issues a command that requests memory access, it does not get the memory unit it wants right away. Instead, it has to wait for some number of CPU cycles for the memory to serve the request. The slower the memory, the longer the CPU has to wait.

II. Possible Solutions and their Drawbacks

There are two possible techniques for solving the problem of CPU — Memory disparity:

  • Read and fetch from the memory when being requested by the CPU but continue to execute and stall the CPU if it tries to use the memory units requested before they have arrived
  • Design computer systems that do not stall but instead require the compilers not to generate code that use memory units which haven’t arrived yet

Obviously, neither of these approaches actually tackle the problem effectively, as they both have some kinds of penalty. The longer the memory access time, the heavier the penalty.

Modern processors place overwhelming demands on the memory system, in terms of both latency (the delay in supplying an operand) and bandwidth (the amount of data supplied per unit of time) [1]. However, these two aspects are rarely seen in harmony. Many approaches increase bandwidth only at the cost of increasing latency, and vice versa.

III. Breakthrough

Storing data in the memory makes fetching inevitably slow. Even when engineers know how to make memories as fast as CPUs, to be able to run at full speed, they have to be located on the CPU chip.

However, this also invokes a question on economics. Putting a large memory on the CPU chip makes it bigger and more expensive, and thus, not market-suitable, not to mention that there are limits to how big a CPU can be. Large memory on the CPU makes it expensive, whereas none of this kind makes it underperforming. How about combining these aspects into a solution that takes the best of both worlds?

The most effective choice is to have a small amount of fast (on-CPU) memory AND a large amount of slow (off-CPU) memory to get the speed of the fast memory and the capacity of the large one at an affordable and market-friendly price. In this case, the small, fast memory is called cache (pronounced “cash”).

IV. Cache mechanism

Cache is logically between the CPU and Main memory. Source: [1]

The whole Caching idea is based on the locality principle, which observes that memory references made in a short period of time tend to only use a small proportion of the total memory [1]. Therefore, we can store this small chunk of memory somewhere outside the main memory for faster access and query.

The mechanism and process of caching is simple: the most frequently accessed memory words (a group of bits making up a unit of data) are kept in a cache. When the CPU need a certain word, it first looks in the cache. If the queried word is available in the cache, the CPU will grab it for usage and save itself a trip to the main memory. The situation when the CPU find what it needs in the cache is called a cache hit. The hit ratio is the fraction of references that can be served with a cache hit.

Otherwise, if and only if the CPU can’t find what it needs in the cache, it will go to the main memory, get the word as well as inserting it into the cache for possible future use. This case is called a cache miss.

If we denote the cache access time as c, the main memory access time as m, and the hit ratio as h, the mean access time t can be calculated by:

t = c + (1 - h)m

with 1 — h being the miss ratio.

It’s easy to see that if a significant amount of the words are in the cache, the average access time can be substantially reduced. If a word is read n times in a short time interval, the CPU will only have to reference to the slow memory once and to the fast memory n — 1times.

In computer systems, it has been acknowledged that memory are accessed not totally randomly. If the CPU references to a memory word at address N, it’s highly possible that the next memory reference would be one of N’s neighbors. Therefore, it’s common that when the CPU makes a reference to a word A, it and some of its neighbors are brought from the large and slow memory to the cache. This group of A and its friends are called a cache line, as they are located together as a block in the memory. When a cache miss occurs, the whole cache line is loaded from the main memory to the cache, not just the word being referenced.


Technology has always developed hand-in-hand with economics, and the story of balancing fast-slow memory sections to optimize pricing serves as a great example. Thanks for reading and see you in the next articles!


[1] Tanenbaum, Andrew S. Structured Computer Organization. Englewood Cliffs, N.J: Prentice-Hall, 1976. Print.