Please consult my [GitHub](https://pdsmart.github.io) website for more upto date information.

The ZPU is a 32bit Stack based microprocessor and was designed by Øyvind Harboe from [Zylin AS](https://opensource.zylin.com/) and original documentation can be found on the [Zylin/OpenCore website or Wikipedia](https://en.wikipedia.org/wiki/ZPU_\(microprocessor\)). It is a microprocessor intended for FPGA embedded applications with minimal logic element and BRAM usage with the sacrifice of speed of execution. Zylin produced two designs which it made open source, namely the Small and Medium ZPU versions. Additional designs were produced by external developers such as the Flex and ZPUino variations, each offering enhancements to the original design such as Wishbone interface, performance etc. This document describes another design which I like to deem as the ZPU Evo(lution) model whose focus is on performance, connectivity and instruction expansion. This came about as I needed a CPU for an emulator of a vintage computer i am writing which would act as the IO processor to provide Menu, Peripheral and SD services. An example of the performance of the ZPU Evo can be seen using CoreMark which returns a value of 19.1 @ 100MHz on Altera fabric using BRAM and for Dhrystone 11.2DMIPS. Connectivity can be seen via implementation of both System and Wishbone buses, allowing for connection of many opensource IP devices. Instruction expansion can be seen by the inclusion of a close coupled L1 cache where multiple instruction bytes are sourced and made available to the CPU which in turn can be used for optimization (ie. upto 5 IM instructions executed in 1 cycle) or for extended multi-byte instructions (ie. implementation of a LoaD Increment Repeat instruction). There is room for a lot more improvements such as stack cache, SDRAM to L2 burst mode, parallel instruction execution (ie. and + neqbranch) which are on my list. # The CPU The ZPU Evo follows on from the ZPU Medium and Flex and areas of the code are similar, for example the instruction decoding. The design differs though due to caching and implementation of a Memory Transaction Processor where all Memory/IO operations (except for direct Instruction reads if dual-port instruction bus is enabled) are routed. The original CPU's all handled their memory requirements in-situ or part of the state machine whereas the Evo submits a request to the MXP whenever a memory operation is required. The following sections indicate some of the features and changes to original ZPU designs. ### Bus structure The ZPU has a linear address space with all memory and IO devices directly addressable within this space. Existing ZPU designs either provide a system bus or a wishbone bus whereas the Evo provides both. The ZPU Evo creates up to two distinct regions within the address space depending on configuration, to provide a *system bus* and a *wishbone bus*. All models have the system bus instantiated which starts at cpu address 0 and expands up-to the limit imposed by the configurable maximum address bit (ie. 0x000000 - 0xFFFFFF for 24bit). A dedicated memory mapped IO region is set aside at the top of the address space (albeit it could quite easily be in any location) ie. 0xFF0000 - 0xFFFFFF. If configured, a wishbone bus can be instantiated and this extends the maximum address bit by 1 (ie. 0x1000000 - 0x1FFFFFF for 24bit example). This in effect creates 2 identical regions, the lower being controlled via the system bus, the upper via the wishbone bus. As per the system bus, the upper area of the wishbone address space is reserved for IO devices. A third bus can be configured, which is for instruction reads only. This bus typically shadows the system bus in memory region but is deemed to be connected to fast access memory for reading of instructions without the need for L2 Cache. This would typically be the 2nd port of a dual-port BRAM block with the 1st port connected to the system bus. ### L1 Cache In order to gain performance but more especially for instruction optimisations and extended instructions, an L1 cache is implemented using registers. Using registers consumes fabric space so should be very small but it allows random access in a single cycle which is needed for example if compacting a 32bit IM load (which can be 5 instructions) into a single cycle. Also for extended instructions, the first byte indicates an extended instruction and the following 1-5 bytes defines the instruction which is then executed in a single cycle. ### L2 Cache Internal BRAM (on-board Block RAM within the FPGA) doesn't need an L2 Cache as it's access time is 1-2 cycles. As BRAM is a limited resource it is assumed external RAM or SDRAM will be used which is much slower and this needs to be cached to increase throughput. The L2 Cache is used for this purpose, to read ahead a block of external RAM and feed the L1 Cache as needed. On analysis, the C programs generated by GCC are typically loops and calls within a local area (unless using large libraries), so implementing a simple direct mapping cache between external RAM and BRAM (used for the L2 Cache) indexed relative to the Program Counter is sufficient to keep the CPU from stalling most of the time. ### Instruction Set A feature of the ZPU is it's use of a minimal fixed set of hardware implemented instructions and a soft set of additional instructions which are implemented in pseudo micro-code (ie. the fixed set of instructions). This is achieved by 32byte vectors in the region 0x0000 - 0x0400 and each soft instruction branches to the vector if it is not implemented in hardware. The benefit is reduced FPGA resources but the penalty is performance. The ZPU Evo implements all instructions in hardware but this can be adjusted in the configuration to use soft instructions if required in order to conserve FPGA resources. This allows for a balance of resources versus performance. Ultimately though, if resources are tight then the use of the Small/Flex ZPU models may be a better choice. In addition to the original instructions, a mechanism exists to extend the instruction set using multi-byte instructions of the format:- ***Extend Instruction,,[byte],[byte],[byte],[byte]*** Where ParamSize = 00 - No parameter bytes 01 - 8 bit parameter 10 - 16 bit parameter 11 - 32 bit parameter Some extended instructions are under development (ie. LDIR) an exact opcode value and extended instruction set has not yet been fully defined. The GNU AS assembler will be updated with these instructions so they can be invoked within a C program and eventually if they have benefit to C will be migrated into the GCC compiler (ie. ADD32/DIV32/MULT32/LDIR/LDDR as from what I have seen, these will have a big impact on CoreMark/Dhrystone tests). ### Implemented Instruction Set | Name | Opcode | | Description | |------------------|-----------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | BREAKPOINT | 0 | 00000000 | The debugger sets a memory location to this value to set a breakpoint. Once a JTAG-like debugger interface is added, it will be convenient to be able to distinguish between a breakpoint and an illegal(possibly emulated) instruction. | | IM | 1xxx xxxx | 1xxx xxxx | Pushes 7 bit sign extended integer and sets the a «instruction decode interrupt mask» flag(IDIM).
If the IDIM flag is already set, this instruction shifts the value on the stack left by 7 bits and stores the 7 bit immediate value into the lower 7 bits.
Unless an instruction is listed as treating the IDIM flag specially, it should be assumed to clear the IDIM flag.
To push a 14 bit integer onto the stack, use two consecutive IM instructions.
If multiple immediate integers are to be pushed onto the stack, they must be interleaved with another instruction, typically NOP. | | STORESP | 010x xxxx | 010x xxxx | Pop value off stack and store it in the SP+xxxxx*4 memory location, where xxxxx is a positive integer. | | LOADSP | 011x xxxx | 011x xxxx | Push value of memory location SP+xxxxx*4, where xxxxx is a positive integer, onto stack. | | ADDSP | 0001 xxxx | 0001 xxxx | Add value of memory location SP+xxxx*4 to value on top of stack. | | EMULATE | 001x xxxx | 010x xxxx | Push PC to stack and set PC to 0x0+xxxxx*32. This is used to emulate opcodes. See zpupgk.vhd for list of emulate opcode values used. zpu_core.vhd contains reference implementations of these instructions rather than letting the ZPU execute the EMULATE instruction.
One way to improve performance of the ZPU is to implement some of the EMULATE instructions.| | PUSHPC | emulated | emulated | Pushes program counter onto the stack. | | POPPC | 0000 0100 | 0000 0100 | Pops address off stack and sets PC | | LOAD | 0000 1000 | 0000 1000 | Pops address stored on stack and loads the value of that address onto stack.
Bit 0 and 1 of address are always treated as 0(i.e. ignored) by the HDL implementations and C code is guaranteed by the programming model never to use 32 bit LOAD on non-32 bit aligned addresses(i.e. if a program does this, then it has a bug).| | STORE | 0000 1100 | 0000 1100 | Pops address, then value from stack and stores the value into the memory location of the address.
Bit 0 and 1 of address are always treated as 0 | | PUSHSP | 0000 0010 | 0000 0010 | Pushes stack pointer. | | POPSP | 0000 1101 | 0000 1101 | Pops value off top of stack and sets SP to that value. Used to allocate/deallocate space on stack for variables or when changing threads. | | ADD | 0000 0101 | 0000 0101 | Pops two values on stack adds them and pushes the result | | AND | 0000 0110 | 0000 0110 | Pops two values off the stack and does a bitwise-and & pushes the result onto the stack | | OR | 0000 0111 | 0000 0111 | Pops two integers, does a bitwise or and pushes result | | NOT | 0000 1001 | 0000 1001 | Bitwise inverse of value on stack | | FLIP | 0000 1010 | 0000 1010 | Reverses the bit order of the value on the stack, i.e. abc->cba, 100->001, 110->011, etc.
The raison d'etre for this instruction is mainly to emulate other instructions. | | NOP | 0000 1011 | 0000 1011 | No operation, clears IDIM flag as side effect, i.e. used between two consecutive IM instructions to push two values onto the stack. | | PUSHSPADD | 61 | 00111101 | a=sp;
b=popIntStack()*4;
pushIntStack(a+b);
| | POPPCREL | 57 | 00111001 | setPc(popIntStack()+getPc()); | | SUB | 49 | 00110001 | int a=popIntStack();
int b=popIntStack();
pushIntStack(b-a); | | XOR | 50 | | pushIntStack(popIntStack() ^ popIntStack()); | | LOADB | 51 | | 8 bit load instruction. Really only here for compatibility with C programming model. Also it has a big impact on DMIPS test.
pushIntStack(cpuReadByte(popIntStack())&0xff); | | STOREB | 52 | | 8 bit store instruction. Really only here for compatibility with C programming model. Also it has a big impact on DMIPS test.
addr = popIntStack();
val = popIntStack();
cpuWriteByte(addr, val); | | LOADH | 34 | | 16 bit load instruction. Really only here for compatibility with C programming model.
pushIntStack(cpuReadWord(popIntStack())); | | STOREH | 35 | | 16 bit store instruction. Really only here for compatibility with C programming model.
addr = popIntStack();
val = popIntStack();
cpuWriteWord(addr, val);
| | LESSTHAN | 36 | | Signed comparison
a = popIntStack();
b = popIntStack();
pushIntStack((a < b) ? 1 : 0); | | LESSTHANOREQUAL | 37 | | Signed comparison
a = popIntStack();
b = popIntStack();
pushIntStack((a <= b) ? 1 : 0); | | ULESSTHAN | 38 | | Unsigned comparison
long a; //long is here 64 bit signed integer
long b;
a = ((long) popIntStack()) & INTMASK; // INTMASK is unsigned 0x00000000ffffffff
b = ((long) popIntStack()) & INTMASK;
pushIntStack((a < b) ? 1 : 0); | | ULESSTHANOREQUAL | 39 | | Unsigned comparison
long a; //long is here 64 bit signed integer
long b;
a = ((long) popIntStack()) & INTMASK; // INTMASK is unsigned 0x00000000ffffffff
b = ((long) popIntStack()) & INTMASK;
pushIntStack((a <= b) ? 1 : 0); | | EQBRANCH | 55 | | int compare;
int target;
target = popIntStack() + pc;
compare = popIntStack();
if (compare == 0)
{
setPc(target);
} else
{
setPc(pc + 1);
} | | NEQBRANCH | 56 | | int compare;
int target;
target = popIntStack() + pc;
compare = popIntStack();
if (compare != 0)
{
setPc(target);
} else
{
setPc(pc + 1);
} | | MULT | 41 | | Signed 32 bit multiply
pushIntStack(popIntStack() * popIntStack()); | | DIV | 53 | | Signed 32 bit integer divide.
a = popIntStack();
b = popIntStack();
if (b == 0)
{
// undefined
} pushIntStack(a / b); | | MOD | 54 | | Signed 32 bit integer modulo.
a = popIntStack();
b = popIntStack();
if (b == 0)
{
// undefined
}
pushIntStack(a % b); | | LSHIFTRIGHT | 42 | | unsigned shift right.
long shift;
long valX;
int t;
shift = ((long) popIntStack()) & INTMASK;
valX = ((long) popIntStack()) & INTMASK;
t = (int) (valX >> (shift & 0x3f));
pushIntStack(t); | | ASHIFTLEFT | 43 | | arithmetic(signed) shift left.
long shift;
long valX;
shift = ((long) popIntStack()) & INTMASK;
valX = ((long) popIntStack()) & INTMASK;
int t = (int) (valX << (shift & 0x3f));
pushIntStack(t); | | ASHIFTRIGHT | 43 | | arithmetic(signed) shift left.
long shift;
int valX;
shift = ((long) popIntStack()) & INTMASK;
valX = popIntStack();
int t = valX >> (shift & 0x3f);
pushIntStack(t); | | CALL | 45 | | call procedure.
int address = pop();
push(pc + 1);
setPc(address); | | CALLPCREL | 63 | | call procedure pc relative
int address = pop();
push(pc + 1);
setPc(address+pc); | | EQ | 46 | | pushIntStack((popIntStack() == popIntStack()) ? 1 : 0); | | NEQ | 47 | | pushIntStack((popIntStack() != popIntStack()) ? 1 : 0); | | NEG | 48 | | pushIntStack(-popIntStack()); |
### Implemented Instructions Comparison Table ![alt text](https://github.com/pdsmart/ZPU/blob/master/docs/ImplInstructions.png) ### Hardware Variable Byte Write In the original ZPU designs there was scope but not the implementation to allow the ZPU to perform byte/half-word/full-word writes. Either the CPU always had to perform 32bit Word aligned operations or it performed the operation in micro-code. In the Evo, hardware was implemented (build time selectable) to allow Byte and Half-Word writes and also hardware Read-Update-Write operations. If the hardware Byte/Half-Word logic is not enabled then it falls back to the 32bit Word Read-Update-Write logic. Both methods have performance benefits, the latter taking 3 cycles longer. ### Hardware Debug Serializer In order to debug the CPU or just provide low level internal operating information, a cached UART debug module is implemented. Currently this is only for output but has the intention to be tied into the IOCP for in-situ debugging when Simulation/Signal-Tap is not available. Embedded within the CPU RTL are statements which issue snapshot information to the serialiser, if enabled in the configuration along with the information level. This is then serialized and output to a connected terminal. A snapshot of the output information can be seen below (with manual comments): | ------------------------------------------------------------ | | 000477 01ffec 00001ae4 00000000 70.17 04770484 046c047c 08f0046c 0b848015 17700500 05000500 05001188 11ef2004

Break Point - Illegal instruction
000478 01ffe8 00001ae4 00001ae4 00.05 04780484 046c0478 08f0046c 0b888094 05000500 05000500 118811ef 20041188

L1 Cache Dump
000478 (480)-> 11 e2 2a 51 11 a0 11 8f <-(483) (004)->11 ed 20 04 05 00 05 00 05 00 05 00 05 00 05 00 20 (46c)->04 11 b5 11 e4 17 70 <-(46f)
(004)-> 11 ed 20 04 05 00 05 00 05 00 05 00 05 00 05 00 20 (46c)->04 11 b5 11 e4 17 70 11 b6 11 c4 2d 27 11 8b <-(473)
05 00 05 00 05 00 05 00 (46c)->20 04 11 b5 11 e4 17 70 11 b6 11 c4 2d 27 11 8b 1c 38 11 80 17 71 17 70 -<(477)
(46c)->20 04 11 b5 11 e4 17 70 11 b6 11 c4 2d 27 11 8b 1c 38 11 80 17 71 17 70 -<(477) 05 00 05 00 05 00 05 00
(470)->11 b6 11 c4 2d 27 11 8b 1c 38 11 80 17 71 17 70 <-(477) -> 05 00 05 00 05 00 05 00 (47c)->11 88 11 ef 20 04 11 88 <-(47f)
(474)->1c 38 11 80 17 71 17 70 05 00 05 00 05 00 05 00 11 88 11 ef 20 04 11 88 11 e2 2a 51 11 a0 11 8f
05 00 05 00 05 00 05 00 11 88 11 ef 20 04 11 88 11 e2 2a 51 11 a0 11 8f 11 ed 20 04 05 00 05 00
11 88 11 ef 20 04 11 88 11 e2 2a 51 11 a0 11 8f 11 ed 20 04 05 00 05 00 05 00 05 00 05 00 05 00
L2 Cache Dump
000000 88 08 8c 08 ed 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000020 88 08 8c 08 90 08 0b 0b 0b 88 80 08 2d 90 0c 8c 0c 88 0c 04 00 00 00 00 00 00 00 00 00 00 00 00
000040 71 fd 06 08 72 83 06 09 81 05 82 05 83 2b 2a 83 ff ff 06 52 04 00 00 00 00 00 00 00 00 00 00 00 | All critical information such as current instruction being executed (or not if stalled), Signals/Flags, L1/L2 Cache contents and Memory contents can be output. # System On a Chip In order to provide a working framework in which the ZPU Evo could be used, a System On a Chip wrapper was created which allows for the instantiation of various devices (ie. UART/SD card). As part of the development, the ZPU Small/Medium/Flex models were incorporated into the framework allowing the choice of CPU when fabric space is at a premium or comparing CPU's, albeit features such as Wishbone are not available on the original ZPU models. I didn't include the ZPUino as this design already has a very good eco system or the ZY2000. The SoC currently implements (in the build tree): | Component | Selectable (ie not hardwired) | | ------------------------- | ------------------------------------------------------------ | | CPU | Choice of ZPU Small, Medium, Flex, Evo or Evo Minimal. | | Wishbone Bus | Yes, 32 bit bus. | | (SB) BRAM | Yes, implement a configurable block of BRAM as the boot loader and stack. | | Instruction Bus BRAM | Yes, enable a separate bus (or Dual-Port) to the boot code implemented in BRAM. This is generally a dual-port BRAM shared with the Sysbus BRAM but can be independent. | | (SB) RAM | Implement a block of BRAM as RAM, seperate from the BRAM used for the boot loader/stack. | | (WB) SDRAM | Yes, implement an SDRAM controller over the Wishbone bus. | | (WB) RAM | Implement a block of BRAM as RAM over the Wishbone bus. | | (WB) I2C | Yes, implements an I2C Controller over the Wishbone bus. | | (SB) Timer 0 | No, implements a hardware 12bit Second, 18bit milliSec and 24bit uSec down counter with interrupt, a 32bit milliSec up counter with interrupt and a YMD HMS Real Time Clock. The down counters are ideal for scheduling. | | (SB) Timer 1 | Yes, a selectable number of pre-scaled 32bit down counters. | | (SB) UART 0 | No, a cached UART used for monitor output and command input/program load. | | (SB) UART 1 | No, a cached UART used for software (C program)/hardware (ZPU debug serializer) output. | | (SB) Interrupt Controller | Yes, a prioritized configurable (# of inputs) interrupt controller. | | (SB) PS2 | Yes, a PS2 Keyboard and Mouse controller. | | (SB) SPI | Yes, a configurable number of Serial Peripheral Interface controllers. | | (SB) SD | Yes, a configurable number of hardware based SPI SD controllers. | | (SB) SOCCFG | Yes, a set of registers to indicate configuration of the ZPU and SoC to the controlling program. | Within the SoC configuration, items such as starting Stack Address, Reset Vector, IO Start/End (SB) and (WB) can be specified. Given the wishbone bus, it is very easy to add further opencore IP devices, for the system bus some work may be needed as the opencore IP devices use differing signals. # Software The software provided includes: 1. A bootloader, I/O Control Program (IOCP). This is more than a bootloader, in its basic form it can bootstrap an application from an SD card or it can include command line monitor tools and a serial upload function. 2. An application, ZPUTA (ZPU Test Application). This is a test suite and can be organised as a single application or split into a Disk Operating System where all functionality is stored on the SD card. ZPUTA can be bootstrapped by IOCP or standalone as the only program in the ROM/BRAM. 3. A disk operating system, zOS (ZPU Operating System). A version of ZPUTA but aimed at production code where all functionality resides as disk applications. 4. Library functions in C to aid in building applications, including 3rd party libs ie. FatFS from El. Chan ### IOCP The I/O Control Program (IOCP) is basically a bootloader, it can operate standalone or as the first stage in booting an application. At the time of writing the following functionality and memory maps have been defined in the build.sh and within the parameterisation of the IOCP/ZPUTA/RTL but any other is possible by adjusting the parameters. - Tiny - IOCP is the smallest size possible to boot from SD Card. It is useful for a SoC configuration where there is limited BRAM and the applications loaded from the SD card would potentially run in external RAM. - Minimum - As per tiny but adds: print IOCP version, interrupt handler, boot message and SD error messages. - Medium - As per small but adds: command line processor to add commands below, timer on auto boot so it can be disabled by pressing a key | Command | Description | | ------- | ------------------------------------------ | | 1 | Boot Application in Application area BRAM | | 4 | Dump out BRAM (boot) memory | | 5 | Dump out Stack memory | | 6 | Dump out application RAM | | C | Clear Application area of BRAM | | c | Clear Application RAM | | d | List the SD Cards directory | | R | Reset the system and boot as per power on | | h | Print out help on enabled commands | | i | Prints version information | - Full - As medium but adds additional commands below. | Command | Description | | ------- | ------------------------------------------ | | 2 | Upload to BRAM application area, in binary format, from serial port | | 3 | Upload to RAM, in binary format, from serial port | | i | Print detailed SoC configuration | ### ZPUTA ZPUTA started life as a basic test application to verify ZPU Evo and SoC operations. As it evolved and different FPGA's were included in the ZPU Evo scope, it became clear that it had to be more advanced due to limited resources. ZPUTA has two primary methods of exection, a) as an application booted by IOCP, b) standalone booted as the ZPU Evo startup firmware. The mode is chosen in the configuration and functionality is identical. In order to cater for limited FPGA BRAM resources, all functionality of ZPUTA can be enabled/disabled within the loaded image. If an SD Card is present then some/all functionality can be shifted from the loaded image into applets (1 applet per function, ie. memory clear) and stored on the SD card - this mode is like DOS where typing a command retrieves the applet from SD card and executes it. The functionality currently provided by ZPUTA can be summarised as follows. | Category | Command | Parameters | Description | | -------- | ------- | ---------- | ----------------------------------------------- | | Disk IO Commands | ddump | \[ \] | Dump a sector | | | dinit | \ \[\] | Initialize disk | | | dstat | \ | Show disk status | | | dioctl | \ | ioctl(CTRL_SYNC) | | Disk Buffer Commands | bdump | \ | Dump buffer | | | bedit | \ \[\] ... | Edit buffer | | | bread | \ \ \[\] | Read into buffer | | | bwrite | \ \ \[\] | Write buffer to disk | | | bfill | \ | Fill buffer | | | blen | \ | Set read/write length for fread/fwrite command | | Filesystem Commands | finit | \ \[\] | Force init the volume | | | fopen | \ \ | Open a file | | | fclose | | Close the open file | | | fseek | \ | Move fp in normal seek | | | fread | \ | Read part of file into buffer | | | finspect | \ | Read part of file and examine | | | fwrite | \ \ | Write part of buffer into file | | | ftrunc | | Truncate the file at current fp | | | falloc | \ \ | Allocate ctg blks to file | | | fattr | \ \ \ | Change object attribute | | | ftime | | Change object timestamp | | | frename | \ \ | Rename an object | | | fdel | \ | Delete an object | | | fmkdir | \ | Create a directory | | | fstat | \[\] | Show volume status | | | fdir | \[\] | Show a directory | | | fcat | \ | Output file contents | | | fcp | \ \ | Copy a file | | | fconcat | \ \ \ | Concatenate 2 files | | | fxtract | \ \ \ \ | Extract a portion of file | | | fload | \ \[\] | Load a file into memory | | | fexec | \ \ \ \ | Load and execute file | | | fsave | \ \ \ | Save memory range to a file | | | fdump | \ \[\] | Dump a file contents as hex | | | fcd | \ | Change current directory | | | fdrive | \ | Change current drive | | | fshowdir | | Show current directory | | | flabel | \