Disclaimer This is but a simple story that roughly approximates how modern computers work. As a data scientist, I find it a useful mental model.
In plain terms, the CPU is the part of the computer that carries out the computations themselves. It does so in a stream of discrete operations, called CPU instructions. Roughly, one instruction is computed for each clock cycle: a 3 GHz CPU executes 3 billion instructions each second. CPUs are fast.
The RAM is the (temporary) memory of the computer, where the data lives. We can picture it as a grid of buckets, each of which can contain 1 byte of information. (From now on, I will use the terms “bytes” and “buckets” interchangeably, depending on whether I want to emphasize to the metaphor or the data.) The grid aspect of it is important:
- The buckets are arranged into rows of 64, called cache lines.
- The buckets have a natural ordering, hence can be univocally identified by enumerating them, their memory address.
One byte, 8 bits, can take 256 different values. In consequence, it can store an integer from 0 to 255. If we need to store a larger value, we need to use multiple bytes.
The CPU can interact with the RAM in a limited number of ways:
- Load data
- Store data
- Increment/decrement the index
- Jump to a part of the program
When we execute a program, it interacts with the operating systems kernel to handle its resources. The memory assigned to a program is split into two components: the stack and the heap. They work differently, and serve different purposes.
The stack is the memory that serves as scratch space for the program. It handles function calls, local variables and context. It appropriately works as a stack, as a Last-In-First-Out (LIFO) structure with two operations: push and pop. For instance, when a function is called, a new block (a frame) is pushed on top of the stack. This frme will contain local variables and other information. When the function returns, the frame is popped.
The heap is the memory set aside for memory allocation: when a program needs more memory, it places a request to the kernel, which will allocate the requested amount from the heap, i.e., reserve it for the program’s use. Once the program does not need that chunk anymore, it will deallocate it, i.e., hand it back to the operating system. Unlike the stack, allocations and deallocations on the heap do not follow a specific order. Instead, it is the task of the program to keep track of what data is stored where, which parts are still used, and which ones are not and can be deallocated. Specifically, many programming languages have a routine called the garbage collector. The garbage collector monitors which pieces of data won’t be needed anymore, and periodically deallocates them. Note that allocations, deallocations and garbage collector’s runs are expensive.
Cache and prefetching
Reading a byte directly from RAM takes around 500 CPU cycles. However, this can be sped up by copying the data to the CPU’s cache, which is closer to the CPU and faster. However, the cache can only hold a few kilobytes. When the CPU needs a piece of data, it first checks if it is already available in the cache. In that case, retrieving it just takes a couple of cycles. If it is not, we are in a situation known as cache miss, in which the program can’t proceed until the required data is retrieved. The amount of time lost in a single cache miss is minuscule. However, if cache misses are common in a program, they can quickly add up.
CPUs have a component, called prefetcher, which tries to mitigate cache misses. To achieve that, it actively tries to predict which pieces of data the CPU will need in the near future, and preemptively copies them to the cache. For instance it assumes that data that lives together works together: when it needs data stored in a particular memory address it fetches the whole cache line. Also, when data is accessed in a predictable manner, the prefetcher will learn and exploit this pattern.
Registers and SIMD
The registers are the small sized slots on which the CPU acts on. Traditionally they had 8 bytes in size, just enough for one 64-bit integer or float. For instance, adding two such numbers requires the CPU to use two registers. Modern CPUs, however, have specialized larger registers, of up to 64 bytes. While they can hold massive 512-bit numbers, it is more interesting to have them hold vectors, e.g., of eight 64-bit numbers or sixteen 32-bit numbers. This unlocks efficient vectorization operations, or single instruction, multiple data (SIMD).