CPU design

The CPU is usually the most complicated part of a modern microcomputer. It consists of several important and intricate sections, most of which are not well understood, even by seasoned computer engineers. Yet a CPU can be actually quite simple if you break it down into its fundamental component parts. The main parts of a CPU are essentially:

Registers A CPU has several registers inside it, which are really very tiny memory locations that are referred to by a name, rather than by a number, as normal memory locations are. Of these registers, the most important is the instruction pointer (IP), a register which contains the address of the next memory location which the CPU should get an instruction from. The instruction pointer is sometimes also known as the program counter (PC). Another important register which almost every CPU has is the accumulator, a general-purpose holding space which the CPU uses to temporarily store values it is getting ready to do something with.

If you don't want to build your own registers, there are many chips in the famous 7400 series of chips which are already pre-constructed data registers. In his monumental Magic-1 project, Bill Buzbee used 74273 and 74374 chips for registers.

Instruction decoder and control matrix The most fundamental job of a CPU is to obey instructions. The CPU receives instructions from memory, and then acts on them. In order to use the instructions it receives, the CPU needs a circuit to actuate the instruction (which is really just a string of 1s and 0s) into action. That circuit consists of two fundamental parts: The instruction decoder, a system which triggers one of several possible action circuits based on which instruction it is currently processing; and the control matrix, which takes the output from the instruction decoder and activates a series of control signals based on which instruction is being executed. Note that not all CPUs use instruction decoders and control matrices; an alternative to this system is to simply use a ROM chip which outputs the control signals needed for each opcode. In this case, each instruction is actually a memory location within the instruction ROM. However, in the construction of very simple CPUs, an instruction decoder and control matrix system is typically used because it makes it easier to see the cause-and-effect of the various parts of the CPU working together.

Timing circuitry Every CPU has a clock input. Every time this clock input goes through a cycle, the CPU goes through another cycle as well. The faster this clock input signal is, the more cycles per second the CPU runs at. You can theoretically make a CPU run faster by increasing the speed of the clock input. In the real world, however, there is a limit to how fast even electrons can flow, and CPUs, being physical objects, are limited to certain speed ranges, mainly because of heat. The faster a CPU runs, the more heat it generates, and a CPU has a speed limit beyond which it is liable to overheat and break down.

Arithmetic Logic Unit (ALU) Many people think that a CPU is just a math machine; a calculator, to perform arithmetic operations on numbers at high speed. While this is only part of the CPU's job, the CPU is indeed where numbers are added, subtracted, multiplied, and divided. This is specifically the job of the ALU, an important but distinct part of a CPU.

If you don't want to build your own ALU, there are a great many chips on the market which serve as pre-constructed ALUs so you don't need to learn how to do arithmetic at the circuit level. In his monumental Magic-1 project, Bill Buzbee used 74381 and 74382 chips for ALUs.

Memory input/output CPUs read to and write from the computer's memory. A CPU has several physical wires that connect it to the RAM and ROM in the computer (mainly constituted by the address bus and the data bus), and through these wires, information is stored and retrieved by the CPU.

Now that we're familiar with some of the sub-systems of a CPU, let's go back and review the list again, this time taking a more detailed look at the structure of each sub-system and the technicalities of how it works.

Registers

Each of the registers in a CPU is essentially just a series of D flip-flops, with one flip-flop for each bit. The data inputs and outputs of these flip-flops are typically connected directly to the data bus. To make it possible to separate the inputs from the outputs, a tri-state buffer is usually put on each flip-flop's input and output line, and the "Enable" pins on the input buffers are then wired together to a single control input, while the Enable pins on the output buffers go to another control input. In this way, you can enable the inputs and outputs on all bits of a register with a single control line.

There are four fundamental registers that usually exist in even the simplest CPUs. Two have already been mentioned: The instruction pointer/program counter and the accumulator. Another two important registers are the MAR (Memory Address Register), which stores memory addresses which are to be later placed on the address bus, and the instruction register, which stores an instruction that has been fetched from memory. These registers might seem superfluous at first glance, but they are needed because a CPU does things step by step; you might at first think that instead of needing a register to store memory addresses, wouldn't it be simpler and faster to just pipe the memory address directly onto the address bus? The answer is yes, that would be simpler, but it would not work because when a memory address is read from the memory, the address bus is currently being used to indicate the address that the address is being read from. To illustrate this by example, suppose that you are reading in the instruction LDA $1FF (an instruction to load the accumulator with the contents of memory address $1FF) from memory location $3FF. In order for this to happen, the address bus must have $3FF on it, because that is what causes the memory to produce the bytes that constitute the LDA $1FF instruction. Since the CPU is currently looking at that location in memory, you can't simply dump the address $1FF on the address bus; you must wait until the address bus is free. Therefore, a MAR is used to temporarily store memory addresses until they are ready to actually be used when the address bus is not being used for any other purpose. Similarly, when you load an instruction from memory, the CPU cannot actually start executing that instruction instantly; it must first stop whatever else it is doing, and since the very act of pulling the instruction out of memory constitutes activity, a separate instruction register is used to hold the instruction until the CPU is ready to deal with it.

Since all of these registers get their data over the same bus (the data bus), most of these registers will need two "Enable" signals: One for their input, and one for their output. Through these Enable wires, it's possible to achieve the necessary state of allowing only one device to place data on the data bus at a time. Since there are several registers and some will have multiple control lines, it makes sense to name these control lines and give abbreviations to these names, so that you can easily and specifically refer to any one control line. These names can be anything you want (after all, when you design something, you get to apply the names you choose to the various parts of your creation), but for the purposes of this page, some kind of standardization is needed so that you'll know what I'm talking about, so let's try to establish some simple code to refer to all these control lines. I'll list the names that you might apply to the control lines in your own CPU. The actual names that you use are up to you, but these are the names I'm going to be using within this document. Note that some of these names are actually semi-standard and you might see them (or something very similar to them) on other documentation describing CPU design.

The accumulator's input enable signal might be called LA (Load Accumulator), while the accumulator's output enable signal might be called EA (Enable Accumulator).

The program counter's output enable signal might be called EP (Enable Program counter).

The MAR's input enable signal might be called LM (Load MAR).

The instruction register's input enable signal might be called LI (Load Instruction register), while the instruction register's output enable signal might be called EI (Enable Instruction register).

To summarize, let's list the control lines we've created to turn the registers' inputs and outputs on or off:

EA: Accumulator Output
EI: Instruction register Output
EP: Program counter Output
LA: Accumulator Input
LI: Instruction register Input
LM: MAR Input

Instruction decoder

A "decoder" is actually a generic name for a relatively simple digital logic device. A decoder has a certain number of binary inputs; if we call the number of inputs n, then the decoder has 2^n outputs. The idea is that for every possible combination of inputs, a single output line is activated. For example, in a 2-to-4 decoder (a decoder with 2 inputs and 4 outputs), you have 4 possible input combinations: 00, 01, 10, or 11. Each of these input combinations will trigger a single output wire on the decoder.

An instruction decoder is simply this concept applied to CPU opcodes. The opcodes of a CPU are simply unique binary numbers: A specific number is used for the LDA instruction, while a different number is used for the STA instruction. The instruction decoder takes the electric binary opcode as input, and then triggers a single output circuit. Therefore, each opcode that the CPU supports actually takes the form of a separate physical circuit within the CPU.

Ring counter

If you're good at thinking ahead with logic, you may have already anticipated an interesting problem that arises using the aforementioned instruction decoder: CPU instructions take more than one step to perform. Whether you're loading the accumulator with some value, storing something in memory, or sending data over an I/O line, almost every CPU opcode takes several steps. How can you perform these steps sequentially through the triggering of a single output? The answer lies partially with a key component of CPU control: The ring counter. This device often goes by other names (partly because "ring counter" is actually a generic name for the device that can be used in contexts that are not specific to CPUs), but I'll continue to call it by this name within this context.

A ring counter is somewhat different from a regular digital counter. To review: A regular counter is simply a series of flip-flops that are lined up in such a way that every time the counter's input triggers, the output of the counter goes up by 1. A 4-bit counter has 4 binary outputs, and will count from 0000 to 1111; after that it will loop around to 0000 and start over again. In contrast, a ring counter only has one single output active at any time. Every time the ring counter's input triggers, the active output moves along. The active output simply cycles through a rotating circle of possible locations, hence the name "ring" counter.

As mentioned, all CPUs have clock inputs. It turns out that all these clock inputs do is drive this ring counter. The ring counter, together with the instruction decoder, take care of all the activity within the CPU.

Among different CPUs, ring counters tend to have varying numbers of outputs; the number of outputs needed really depends on how many steps the CPU needs to perform each instruction. Since you can't create new outputs out of thin air (they are actual physical wires), the ring counter needs to have as many outputs as there are steps in the lengthiest CPU opcode. Any instructions which don't need that many steps can "waste" cycles by allowing the ring counter to circle around, but this is obviously inefficient, so a separate "reset" input usually exists on the ring counter so that it can start counting from the beginning if the current instruction has finished.

If you don't want to make your own ring counter circuit, there are two chips in the famous 4000 series of chips which act as ring counters: The 4017 is a 10-stage ring counter, while the 4022 is an 8-stage ring counter.

Having made it this far, we now have an instruction decoder that will give us one active wire representing which opcode the CPU is running, and a ring counter that allows us to proceed step-by-step through this instruction process. Obviously, we need some way of merging these control signals so that the actual work can get done. Where does this take place? In the most hairy and complex realm of the CPU: The control matrix.

Control matrix

The control matrix is where things really get crazy inside a CPU. This is where each output from the instruction decoder and the ring counter meet in a large array of digital logic. The control matrix is typically made of many logic gates which are wired in a highly architecture-specific way to activate the correct control lines that are needed to fulfill the opcode being processed.

For example, suppose the instruction decoder has activated the LDA FROM MEMORY output, meaning the CPU has been given an LDA instruction to load the accumulator with the value from a specific memory address. The first step in executing this instruction might be to increment the instruction pointer so it can retrieve the desired memory address from memory. Thus, the LDA FROM MEMORY output from the instruction decoder and the first output from the ring decoder might go to and AND gate, which of course will then trigger only during the very first step of an LDA FROM MEMORY instruction. The output of this AND gate might go directly to the clock input on the instruction decoder. The second step in executing this instruction might be to load the MAR with the memory address desired which the instruction pointer is now helpfully pointing at, so perhaps the LDA FROM MEMORY output and the second ring counter output meet at another AND gate which goes directly to LM. And so on. This is how instructions are performed inside a CPU.

In smaller hand-made CPUs, the control matrix is often made of an array of diodes and wires, but because of the complexity of the control matrix in a larger CPU with several possible instructions, this quickly becomes impractical for a CPU with more than a dozen or so instructions. Therefore, the control matrix is often actually implemented as a ROM/PROM/EPROM/EEPROM chip. In this arrangement, each "address" of the ROM chip actually becomes a combination of inputs from the instruction decoder and ring counter. The data stored at each "address" of the ROM is the correct combination of control outputs. You can then quickly create and reconfigure your control logic by simply reprogramming the ROM chip. This kind of programming is the lowest-level programming possible in a general computer, and is frequently called "microprogramming," while the code that goes into the ROM is usually called "microcode."

Verilog CPUs

It's fun to create a simple CPU out of a series of electronic components on a breadboard or something similar. However, today people who want to design their own CPU typically do so through software, using a hardware description language like Verilog. Since Verilog is a very powerful language that makes it surprisingly easy to design and model the complex workings of a computer's CPU, it makes sense to examine how we can write a piece of Verilog code that will act like a CPU.

Since this is a section on CPU architecture specifically, we won't be going into the details of Verilog syntax here. If you want to learn a bit more about coding with Verilog, check out my own Verilog section, or another learning resource; several excellent books and websites about Verilog exist, from which you can learn a lot.

Since all the CPU really wants to do is perform instructions, and it needs to check the instruction pointer to know where to look for the next instruction, let's start with the instruction pointer. Assuming we have a 16-bit address bus (as most early microcomputers did) and we can run an instruction from any location in memory, we'll need a 16-bit instruction pointer. We can declare this item thusly:

reg [15:0] IP;

Because the IP gets its instructions from memory, we'll also need some way of accessing memory. In modern computer design, the memory is often integrated into the same programmable chip as the CPU, so that a single chip becomes the entire computer (a system known as a SOC, or System On a Chip). However, for this discussion, we're only making a true CPU, not an entire SOC, so the device we're making has no memory of its own and thus needs external memory buses. The buses are actual, physical wires, so we can define them as such:

output [15:0] abus; //address bus
inout [7:0] dbus; //data bus

Now we have a way of specifying a memory address: The abus output.

Many CPUs use idiosyncratic specifications to deal with the problem of deciding where to start getting instructions in memory. Some CPUs have a specific point that they start executing instructions from, others use a reset vector, a location in memory that stores another memory location from which to begin executing instructions. Some people may argue the virtue of one system over another, but for our current discussion, let's keep things simple and just assume that our CPU begins executing instructions from memory address 0, and just proceeds from there. To do this, we'll first need to set our address bus to equal 0, so our memory chips will output whatever instruction is at address 0.

IP = 0;

This line initializes our instruction pointer to its first memory location. From here on, we will need to get our instructions from memory by setting the address bus to reflect whatever is stored in the instruction pointer.

abus = IP;

Assuming the interface between the CPU and the memory now works properly, the data bus should currently be reflecting whatever is at the specified memory address. If the memory has been programmed correctly, this is an opcode, a binary number which corresponds to a specific instruction that the CPU can perform. The CPU needs to be able to act on this opcode and select a course of action based on exactly what that opcode is. This is the job for the instruction decoder, which in Verilog can be programmed quite easily using the case statement.

case(dbus)
  opcode1: //code for opcode 1
  opcode2: //code for opcode 2
endcase

Here's where we need to get creative. We need to invent our own opcodes. Every CPU has a set of opcodes, and each opcode has a specific number associated with it. Since we're not making a world-class CPU right now, we can start with just a very basic set of instructions for the time being. Let's say that the two most fundamental opcodes a CPU can have are LOAD and STORE: An opcode to load the CPU's accumulator with a value from a specific memory address, and an opcode to store the value presently in the accumulator into a specific memory address. (BURY and DISINTER, as Cryptonomicon's Lawrence Pritchard Waterhouse called them.) Since we want to keep things simple, let's furthermore assign the numbers 0 and 1 to opcodes LOAD and STORE, respectively. So, we'd modify the case statement above to look something like this:

case(dbus)
  0: //LOAD instruction
    begin
      //code for the LOAD instruction goes here!
    end
  1: //STORE instruction
    begin
      //code for the STORE instruction goes here!
    end
endcase

Obviously, we need to fill in the actual code to perform the opcodes, but we can get to writing the actual code for the CPU instructions later.

Tying it all together, then, all you really need to make a CPU in Verilog is the initial register and pin declarations, a reset vector (or at least some set location in memory where instruction execution will begin), and then just the code for all of the CPU's individual opcodes. That's really it. You can make a working CPU that simply. You can connect it to a ROM and put some small program in the ROM to test your CPU.

Below is a general idea of what your CPU code might look like overall. This code is obviously just a rough draft and you'll want to improve on it if you plan to use your CPU for anything. This is just to give you an idea of what can be done and how to do it. Be creative!

reg [15:0] IP; //instruction pointer
reg [7:0] A; //accumulator

input clk; //CPU clock input
output [15:0] abus; //address bus
inout [7:0] dbus; //data bus
output RW; //read/write output so the memory knows whether to read or write

IP = 0; //initial location of instruction execution

always @(posedge clk) //main CPU loop starts here
  begin

  RW = 0; //set read/write low so it reads from memory
  abus = IP; //put IP on the address bus, so that the memory produces
                    //the next instruction on the data bus.

  case(dbus)
    0: //LOAD instruction
      begin
        IP = IP + 1; //increment IP so it points to the memory vector
        RW = 0; //set read/write low so it reads from memory
        abus = IP; //examine the memory location it points to
        A = dbus; //load the accumulator with the value at the address
        IP = IP + 1; //increment IP so it points to the next instruction
      end
    1: //STORE instruction
      begin
        IP = IP + 1; //increment IP so it points to the memory vector
        RW = 0; //set read/write low so it reads from memory
        abus = IP; //examine the memory location IP points to
        RW = 1; //set read/write high so it writes to memory
        dbus = A; //send the accumulator onto the data bus
        IP = IP + 1; //increment IP so it points to the next instruction
      end
  endcase

end

Back to the main page