The Processor Background. (Preliminary)

The idea is to use more processors and to fetch more instructions simultanously. This is expected to cause a significant higher speed because the typical delay is limited by the forward delay only. Delays in control structures limits the cycle frequence, but it is morely limited by the maximum instruction speed from the code memory.
Tcy = fd1 + fd2 + rd1 + rd2
where
fd1 is forward delay on setting data
fd2 is forward delay on removing data
rd1 is reverse delay on setting ack
rd2 is reverse delay on removing ack
In general is fd1<>fd2 and rd1<>rd2.

The difference is significant when using CVSL logic style. CVSL logic style use N-logic to pull down (setting data) and a special control logic to remove the data, and

Tcy = Forward Delay (setting) + Delay in control structure
This makes Tcy and fd1 to be only importent delays. The delays in control structures (fd2+rd1+rd2) will only influence the max. cycle frequence.

It is very importent to minimalize the delay from previos processor to next since this delay affects the minimum instruction time. When there are no dependencies of two instructions, then the processor is even faster and the path from Pn-2 is limiting.

Figure 1

Pn  = Output from actual processor
Pn-1= Output from previos processor
Pn-2= Output from processor two back
Pn-3= Output from processor ALU-3

In worst case will all instructions be executed in order and depend on the output from the previos ALU, Pn-1. The Figure shows the critical path, where all outputs depends on the output from previos ALU. Output delay from ALU-1 is most critical, but the output delay from ALU-2 is very critical too and need to be carefully optimized.

The carry chain is pipelined, and not critical. The delay on carry chain will slow down conditional jump, but only if flag is recently affected.

Figure 2

The critical bit, is the bit from previos ALU, and not the carry chain. The carry chain is pipelined, and a delay between the bits of the ALU will not affect speed. The delay only affect the time to get carry or flags from the ALU. This will make condtitional jumps slower, but only in the few cases where it check the flag from recently executed instructions. In these cases may be used a default jump direction, and only if the jump direction is unexpected will it influence performance. If the delay is a problem then carry look ahead may be used.

To minimialize the critical bit from previos ALU carefully, will be used that the output on ALU n, depends on the output ALU n-1 in folowing way:

Case 1: Output always high, no instruction dependencies
Case 2: Output always low, no instruction dependencies
Case 3: Output equal to output from previos ALU
Case 4: Output equal to complement value from previos ALU
The structure look's like the carry chain, but is a 'sum'-chain. It may be improved with look-ahead logic (works as carry look-ahead, but is instruction look-ahead). Without look-ahead, will be used a fast multiplexer structure (CVSL based):

Figure 3

The delay is expected to be less than 1ns or even better.

A faster multiplexer may be obtained by using analog comperators, or detecting by current (or both):

Figure 4 Figure 5

Optimizeing both Pn-1 and Pn-2

There might be problems with influence on noise on Vcc, and x may be charged more than needed.

*) To obtain a high speed may the circuit need to differ from figure 2 a bit to get high speed between ALU's.

Because of the asynchronous design will instruction without dependencies automaticly be executed faster, but it is a maximum instruction flow, which limits the maximum speed of a cycle. This limits the performance in the cases where the performance is not limited by dependencies. It is very importent in a loop where it is much better to fold out small loop's, than inserting NOP's, even that NOP not cause any dependencies. Folding out loop's.

The typical output not only depends on result from previos ALU, but usely two inputs, Pn-x and Pn-y. (The ALU has two operands). The circuit is optimized better for one of these inputs, and an automatic selection of the primary input is done at storing instructions in the cache. To allow operand swapping, is required subtract commands for both subtract with Pn-x and for subtract with Pn-y. The non optimized input will not need to handle Pn-1.


The Instruction Set.

Example on instruction set.

Since all ALU's need to get information from ALU-x and ALU-y, will it influence the instruction set.

It leads to an instruction set where indexing of operands is relative to the actual instruction executed. The instruction set is generalized to be used even if the operand indexing exceeds the number of processors - then the value is taken from a small fifo.

This methode also has ability to transfer parameters to functions or procedures and only a code-stack is needed.

Using Loop's will cause the operands to disapear and we need to store the values that run out of the loop. This is done by using a memory bank to store all the operand values. In loops will it be possible to access operands relative to the actual instruction, but also by using the stored values. The banks has been arranged to be stacked. Then it is possible to use it in more loops in each other, and it is efficent for recursive functions too. It is more work on top of the stack than in bottum, and it is possible to extend an internal stack with an external and the internal reduce the speed (external bandwidth) of the external stack.

The bank does not allow indexed variables why the processor need to be connected to an ordinary RAM, indexed by operand.

It is difficult to allow multiple access to this memory because of compleate asynchronous design, and the memory will be connected to only one of the processors. It is only this processor that is allowed to execute memory operations, and all access to memory need to use this processor. Other processors may execute NOP's, if no other job to do.

There might be more memory modules to allow higher bandwidth, or a small register may be connected to all processors. The configuraton may be chosen to optimize a simulation / execution of an existing instruction set.

The flags has same structure as the data, which means that it works relative to the actual instruction as the data operands does, and it is connected to a stack / bank too. This structure makes the conditional jumps independent of when the flags is affected, and it is possible to read and check the condition of an old flag, before the new flags are affected. This makes the cpu very efficient.

A standard cpu will be able to jump in a default direction if the flags is unaffected, and to chose the right direction when flags is affected. If the jump was taken in a wrong direction, then the instruction will need to be re-runned from the jump. That was one of the ideas of using a bank for storing the data that it should allow re-running code. It is not as importent to implement this feature in this processor as in other processors since typical jump's use an old flag, that is already affected, or if not then it may be possible to make the code to jump on an old flag, and use an extra dummy loop that is ignored when the flag is affected. For most cases will this pipelining technic be as efficient as using a default jump, and ignoring data after an unexpected jump direction. (See software example).

A re-run feature may be effective in other cases, to make it possible to use a default value and re-run if it changes. As an example could be mensioned an external memory that returns the value before the worst-case time, and then it is possible to use the data at the time when it is returned. A re-run will be done if data changes. This feature allows higher speed, since it not need to wait on worst case time.


Simulating existing code.

It is very importent to have an instruction set that is compatible to other processors, but it is very difficult to make an advanced instruction decoding without delays. The delay is most critical at conditonal jump's, where the direction is decided by flags. A much cheaper solution may be a processor dependent code, that may even be dedicated the internal hardware structure.

This may be handled by compiling the decided instruction set into a sort of cache with processor code.

When reading instructions from standard memory, is they compiled down to a cache with the internal instruction format and the user should not know about the internal code. Because the compilation is done by software is the cache need to be large.

If out of cache, then a recompile may be needed, or if special instructions as JMP BX, then a table is needed to convert the addresses. (Or a cache if address room is large). Stores that modify the instructions also need special care (recompilation and updating caches). The recompilation will not need to be done at the time where the instructions is updated, but an interrupt is stored at the affected address to guarentee recompilation. The stack is a bit special too, and need to handle both physical and emulated address. As example, push bx + return = jmp bx, while call + return use internal stack.

If it is a demand that the processor need to be able to execute code to other processors, then it is possible to update the compiler and use other instructions. The compatibility to future machines may also be possible.

Register based instruction sets may demand a common register file and to simplify the design will the code need to guarentee that data is not stored at same time. All instructions is extended to be able to write in a register cell, and to be able to read the registers as operand. And the software guarentee space between any storing to same cell. Until data is stored in the common register file, is the data only aviable at the fifo, relative to the actual instruction. Because of using a compiler based simulation of code, will it be possible for the compiler to handle all the problems, and to store directly in code where to get the operands. A hardware security circuit may be needed to avoid making a short-circuit when storing in the same register.


Correcting errors.

To improve the reliability of the processor may all instructions be executed twice. If they differ, then the processor need to retry to correct errors. This will not need extra hardware on a multitasking system since multiple processes may be runned and verifying each other. The task switch and verify is typical done before any Input/Output, or at every n instructions. The value on n depends on the typical elapsed number of instructions between an error.

This methode will correct errors as spikes, induced noise, but not physical defects. The correcting phase is delayed, and the correction will work if there is more errors at same time, because it will not affect the correcting phase. This kind of errors are typical if noise or transients on Vcc. It covers random errors only, and will not correct the errors if the correcting phase is equal to the calculated.

The idea is that if the processor is able to run at twice speed, but not as reliable, then it may be worth to try. A low noise margin may increase the speed, and to be sure of proper operation may the processor need to correct if error.


© 1996-1997 and 1998, Jens Dyekjær Madsen.
E-Mail address: Guestbook.