Previous: The
MOVE concept Up: The
MOVE project Next: MOVE
framework
Advantages of transport-triggered architectures
Advantages of TTAs can be split into implementation advantages and new software compilation optimization possibilities. The most important implementation advantages are:
Ideal for employing superpipelining
at both operation and data transport level. FU pipelines can be stretched
to make shorter cycle times possible. The only lower bound on the clock
cycle is register-register transfer time across the network. The network
can be superpipelined itself; in that case the achievable clock cycle time
reduces to the register-register transfer time within one network cluster;
this time can be very short.
Ideal for employing functional
parallelism at both operation and data transport level. FUs and transport
capacity can be added to increase parallelism. Unlike OTAs, CPUs using
a TTA do not need three busses and three register ports for each operation
per instruction; e.g. twelve ports for a four operation per instruction
VLIW. Three busses and ports per operation is a worst case assumption,
since many operations do not need them. For example, some operations need
a single source operand, or do not produce a result. Also a lot of results
are directly bypassed to the next FU, without needing to be stored in a
GPR. Further the compiler can perform many optimizations specific for TTAs
which reduce the needed transport capacity, and therefore the network requirements
(like number of transport busses and network connectivity), even further.
As a consequence, the required VLSI area is reduced and the cycle time
is further improved.
Ideal for the design of application
specific processors (ASPs). FU parameters (like number of FUs, supported
operations, latencies, throughput and pipelining degree) and interconnection
network parameters (like topology, number of busses and pipelining degree)
can be set according to the needs of an application domain.
Ideal for automatic generation.
The CPU has a very simple design; this is a consequence of having independent
FUs and of the reduced data transport requirements. Furthermore, the network
has less complexity; e.g. no complex bypassing hardware is needed, bypassing
is done in software. Due to its simplicity, it becomes possible to use
a silicon compiler for automatic layout generation. The inputs to the silicon
compiler are a template description, VLSI building blocks (e.g. FUs), and
values for the architecture parameters. The output is a CPU silicon layout
ready for fabrication of the CPU.
Perfect suitable for incorporating
register mapped interprocessor communication support; the concept of register
mapped functionality is inherently supported by a TTA. Register mapped
communication enables very short latencies and a high communication bandwidth.
This opens the possibility of integrating systolic communication within
a general purpose processor framework.
Besides the traditional compiler optimizations, the TTAs offer the following unique optimizations:
More scheduling freedom. TTAs
divide operations into smaller data transport components. This makes parallelism
more fine-grained, and the resulting code schedules are more efficient
and achieve higher CPU execution performance.
Software bypassing. A result
of an operation can be used for another operation in the same cycle if
software bypassing is applied. Bypassing means getting the value from the
FU that produced it instead of the RU where it will be stored. Bypassing
reduces the delay between (true) dependent operations.
Unlike OTAs, TTAs do not need
special (associative) hardware to do bypassing, but all bypassing can be
done in software under control of the compiler. For example, when an operand
move r0 -> add_O uses the result of a result move add_T ->
r0 in the same cycle as the result move, then the operand move needs
to be changed into add_T -> add_O so that the result is taken
from the FU instead of the RU.
Operand sharing. When two
successive operations on the same FU are guaranteed to have the same value
in the operand register, one operand move can be saved. We call this operand
sharing, and it can be viewed as a special form of common subexpression
elimination. When one operand move is shared among all iterations of a
loop, then the operand move is loop invariant and can be placed before
the loop.
Dead result move elimination.
When all uses of a result are used via software bypassing, then the result
move can be eliminated. This saves one move and the usage of one GPR. Since
most results are only used once or twice (e.g. temporaries), dead result
move elimination occurs frequently.
Reduced GPR demand. TTAs need
fewer GPRs since (1) results are directly bypassed from FU to FU, and (2)
operations can stay longer in a pipeline than is needed to do the operations,
this makes GPR lifetimes shorter.
Initial experiments [11] have shown that the results of these new optimizations are very promising. The experiments show that TTAs perform 20-50% better than OTAs with similar hardware for scientific code using adapted software pipelining techniques.
In [12] TTAs are analysed for general purpose applications, using a basic block scheduling compiler. It is shown that a operation based architectures require up to 30% more transport capacity, however in order to efficiently exploit this capacity by a TTA we need a scheduling scope far beyond basic block boundaries. The latter is currently under development. First results will be publiced in [13]. It is shown that scheduling beyond basic blocks gives a 40% performance improvement.
For a more extensive description and evaluation of the MOVE concept see e.g. [15][11][14][10].
Last modified on March 18th, 1997 by Irek Karkowski, email I.Karkowski@et.tudelft.nl