The difference is significant when using CVSL logic style. CVSL logic style use N-logic to pull down (setting data) and a special control logic to remove the data, and
It is very importent to minimalize the delay from previos processor to next since this delay affects the minimum instruction time. When there are no dependencies of two instructions, then the processor is even faster and the path from Pn-2 is limiting.
Pn = Output from actual processor Pn-1= Output from previos processor Pn-2= Output from processor two back Pn-3= Output from processor ALU-3
In worst case will all instructions be executed in order and depend on the output from the previos ALU, Pn-1. The Figure shows the critical path, where all outputs depends on the output from previos ALU. Output delay from ALU-1 is most critical, but the output delay from ALU-2 is very critical too and need to be carefully optimized.
The carry chain is pipelined, and not critical. The delay on carry chain will slow down conditional jump, but only if flag is recently affected.
The critical bit, is the bit from previos ALU, and not the carry chain. The carry chain is pipelined, and a delay between the bits of the ALU will not affect speed. The delay only affect the time to get carry or flags from the ALU. This will make condtitional jumps slower, but only in the few cases where it check the flag from recently executed instructions. In these cases may be used a default jump direction, and only if the jump direction is unexpected will it influence performance. If the delay is a problem then carry look ahead may be used.
To minimialize the critical bit from previos ALU carefully, will be used that the output on ALU n, depends on the output ALU n-1 in folowing way:
The delay is expected to be less than 1ns or even better.
A faster multiplexer may be obtained by using analog comperators, or detecting by current (or both):
|
![]() |
There might be problems with influence on noise on Vcc, and x may be charged more than needed.
*) To obtain a high speed may the circuit need to differ from figure 2 a bit to get high speed between ALU's.
Because of the asynchronous design will instruction without dependencies automaticly be executed faster, but it is a maximum instruction flow, which limits the maximum speed of a cycle. This limits the performance in the cases where the performance is not limited by dependencies. It is very importent in a loop where it is much better to fold out small loop's, than inserting NOP's, even that NOP not cause any dependencies. Folding out loop's.
The typical output not only depends on result from previos ALU, but usely two inputs, Pn-x and Pn-y. (The ALU has two operands). The circuit is optimized better for one of these inputs, and an automatic selection of the primary input is done at storing instructions in the cache. To allow operand swapping, is required subtract commands for both subtract with Pn-x and for subtract with Pn-y. The non optimized input will not need to handle Pn-1.
Since all ALU's need to get information from ALU-x and ALU-y, will it influence the instruction set.
It leads to an instruction set where indexing of operands is relative to the actual instruction executed. The instruction set is generalized to be used even if the operand indexing exceeds the number of processors - then the value is taken from a small fifo.
This methode also has ability to transfer parameters to functions or procedures and only a code-stack is needed.
Using Loop's will cause the operands to disapear and we need to store the values that run out of the loop. This is done by using a memory bank to store all the operand values. In loops will it be possible to access operands relative to the actual instruction, but also by using the stored values. The banks has been arranged to be stacked. Then it is possible to use it in more loops in each other, and it is efficent for recursive functions too. It is more work on top of the stack than in bottum, and it is possible to extend an internal stack with an external and the internal reduce the speed (external bandwidth) of the external stack.
The bank does not allow indexed variables why the processor need to be connected to an ordinary RAM, indexed by operand.
It is difficult to allow multiple access to this memory because of compleate asynchronous design, and the memory will be connected to only one of the processors. It is only this processor that is allowed to execute memory operations, and all access to memory need to use this processor. Other processors may execute NOP's, if no other job to do.
There might be more memory modules to allow higher bandwidth, or a small register may be connected to all processors. The configuraton may be chosen to optimize a simulation / execution of an existing instruction set.
The flags has same structure as the data, which means that it works relative to the actual instruction as the data operands does, and it is connected to a stack / bank too. This structure makes the conditional jumps independent of when the flags is affected, and it is possible to read and check the condition of an old flag, before the new flags are affected. This makes the cpu very efficient.
A standard cpu will be able to jump in a default direction if the flags is unaffected, and to chose the right direction when flags is affected. If the jump was taken in a wrong direction, then the instruction will need to be re-runned from the jump. That was one of the ideas of using a bank for storing the data that it should allow re-running code. It is not as importent to implement this feature in this processor as in other processors since typical jump's use an old flag, that is already affected, or if not then it may be possible to make the code to jump on an old flag, and use an extra dummy loop that is ignored when the flag is affected. For most cases will this pipelining technic be as efficient as using a default jump, and ignoring data after an unexpected jump direction. (See software example).
A re-run feature may be effective in other cases, to make it possible to use a default value and re-run if it changes. As an example could be mensioned an external memory that returns the value before the worst-case time, and then it is possible to use the data at the time when it is returned. A re-run will be done if data changes. This feature allows higher speed, since it not need to wait on worst case time.
This may be handled by compiling the decided instruction set into a sort of cache with processor code.
When reading instructions from standard memory, is they compiled down to a cache with the internal instruction format and the user should not know about the internal code. Because the compilation is done by software is the cache need to be large.
If out of cache, then a recompile may be needed, or if special instructions as JMP BX, then a table is needed to convert the addresses. (Or a cache if address room is large). Stores that modify the instructions also need special care (recompilation and updating caches). The recompilation will not need to be done at the time where the instructions is updated, but an interrupt is stored at the affected address to guarentee recompilation. The stack is a bit special too, and need to handle both physical and emulated address. As example, push bx + return = jmp bx, while call + return use internal stack.
If it is a demand that the processor need to be able to execute code to other processors, then it is possible to update the compiler and use other instructions. The compatibility to future machines may also be possible.
Register based instruction sets may demand a common register file and to simplify the design will the code need to guarentee that data is not stored at same time. All instructions is extended to be able to write in a register cell, and to be able to read the registers as operand. And the software guarentee space between any storing to same cell. Until data is stored in the common register file, is the data only aviable at the fifo, relative to the actual instruction. Because of using a compiler based simulation of code, will it be possible for the compiler to handle all the problems, and to store directly in code where to get the operands. A hardware security circuit may be needed to avoid making a short-circuit when storing in the same register.
This methode will correct errors as spikes, induced noise, but not physical defects. The correcting phase is delayed, and the correction will work if there is more errors at same time, because it will not affect the correcting phase. This kind of errors are typical if noise or transients on Vcc. It covers random errors only, and will not correct the errors if the correcting phase is equal to the calculated.
The idea is that if the processor is able to run at twice speed, but not as reliable, then it may be worth to try. A low noise margin may increase the speed, and to be sure of proper operation may the processor need to correct if error.