The processor template from which the application specific processors designed with TCE are defined is called Transport Triggered Architecture (TTA). For a detailed description behind the TTA idea, refer to [Cor97]. A short introduction is presented in [Bou01].
TTA is based on VLIW (Very Long Instruction Word) processor paradigm but solves its major bottlenecks: the complexity of the register file (RF) and register file bypass network. TTA is statically scheduled at compile time and supports instruction-level parallelism (ILP) like VLIW. TTA instructions are commonly hundreds of bits wide. TTAs can have multiple independent function units (FUs) and a customized interconnection network, as illustrated in Fig. 7.1.
The term transport-triggered means that instruction words control the operand transfers on the interconnection network and the operation execution happens as a side effect of these transfers. An operation is triggered to execute when a data operand is transferred to a specific trigger input port of an FU. In the instructions there is one slot for each transport bus. For example, ``FU0.out0 -> LSU.trig.stw'' moves data from the output port of function unit to the trigger input port of load-store unit (LSU). After that the LSU starts executing the triggered operation, in this case store word to memory.
In a basic case, all FU input and output ports are registers which relieves the pressure on the register file. Thanks to the programming model, operands can be bypassed directly from one FU to another. This is called software bypassing. Additionally, if the bypassed operand is not needed by anyone else there is no need to write the operand to a register file. This optimization technique is called dead result elimination. Combining software bypassing with dead result elimination helps to reduce register file traffic. Moreover, the TCE TTA template gives freedom for partitioning register files. For example, there can be several small and simple register files instead one centralized multiported RF.
The transport programming is also beneficial because it allows easy scalability of the architecture and compiler, as well as supports varying pipeline depths at FUs. At the same time, the number of FU inputs and outputs is not restricted, unlike in most processor templates which support only instructions with 2 inputs and 1 output value. User can also create instruction set extensions with a special function unit (SFU) which can have arbitrary number of I/O operands. Instruction set extension is a powerful way of enhancing the performance of certain applications.
Property | Values | Example |
Functional unit | Type, count | 3x ALU, 2x LSU, 1x MUL, 1x ctrl... |
Register file (RF) | # registers, #RFs, #ports, width | 16x 32b RF 2x rd + 2x wr ports, 16x 1b boolean RF |
Interconnection network | #buses, #sockets | 5 buses, total of 43 write and 44 read sockets |
Memory interfaces | Count, type | 2x LSU for SRAM w/ 32b data & 32b addr |
Special FU | User-defined functionality | dct, semaphor_lock, FIFO I/O |
The TTA template supports two ways of transporting program constants in instructions, such as ``add value+5''. Short immediates are encoded in the move slot's source field, and thus consume a part of a single move slot. The constants transported in the source field should usually be relatively small in size, otherwise the width of a move slot is dominated by the immediate field.
Wider constants can be transported by means of so called long immediates. Long immediates can be defined using an ADF parameter called instruction template. The instruction template defines which slots are used for pieces of the instruction template or for defining the transports (moves). The slots cannot be used for regular data transports when they are used for transporting pieces of a long immediate.
An instruction template defining a long immediate also provides a target to which the long immediate must be transported. The target register resides in a so called immediate unit which is written directly from the control unit, not through the transport buses. The immediate unit is like a register file expect that it contains only read ports and is written only by the instruction decoder in the control unit when it detects an instruction with a long immediate (see Fig. 7.1).
Thus, in order to support the special long immediate encoding, one has to add a) an instruction template that supports transporting the pieces of the immediate using full move slots b) at least one long immediate unit (a read-only register file) to which the instruction writes the immediates and of which registers the immediates can be read to the datapath.
Operations in TCE are defined in a separate database (OSAL, Sections 2.2.6 and 4.3 ) in order to allow defining a reusable database of ``operation semantics''. The operations are used in processor designs by adding function units (FU) that implement the wanted operations. Operands of the operations can be mapped to different ports of the implementing FU, which affects programming of the processor. Mapping of operation operands to the FU ports must be therefore described by the processor designer explicitly.
Example. Designer adds an FU called 'ALU' which implements operations 'ADD', 'SUB', and 'NOT'. ALU has two input ports called 'in1' and 'in2t' (triggering), and an output port called 'out'. A logical binding of the 'ADD' and 'SUB' operands to ALU ports is the following:
ADD.1 (the first input operand) bound to ALU.in1 ADD.2 (the second input operand) bound to ALU.in2t ADD.3 (the output operand) bound to ALU.out SUB.1 (the first input operand) bound to ALU.in1 SUB.2 (the second input operand) bound to ALU.in2t SUB.3 (the output operand) bound to ALU.out
However, operation 'NOT', that is, the bitwise negation has only one input thus it must be bound to port 'FU.in2t' so it can be triggered:
NOT.1 bound to ALU.in2t NOT.2 (the output operand) bound to ALU.out
Because we have a choice in how we bind the 'ADD' and 'SUB' input operands, the binding has to be explicit in the architecture definition. The operand binding described above defines architecturally different TTA function unit from the following:
SUB.2 bound to ALU.in1 SUB.1 bound to ALU.in2t SUB.3 bound to ALU.out
With the rest of the operands bound similarly as in the first example.
Due to the differing 'SUB' input bindings, one cannot run code scheduled for the previous processor on a machine with an ALU with the latter operand bindings. This small detail is important to understand when designing more complex FUs, with multiple operations with different number of operands of varying size, but is usually transparent to the basic user of TCE.
Reasons for wanting to fine tune the operand bindings might include using input ports of a smaller width for some operation operands. For example, the width of the address operands in memory accessing operations of a load store unit is often smaller than the data width. Similarly, the second operand of a shift operation that defines the number of bits to shift requires less bits than the shifted data operand.
The currently identified and supported connectivity levels are, in the order of descending level of connectivity, as follows:
The easy target for the high-level language compiler tcecc. However, not a realistic design usually due to its high implementation costs.
An easy target for tcecc.
An easy target for tcecc. However, reduction of bypass connections means that less software bypassing can be done.
Compilation is fully supported by tcecc. The number of copies is not limited by tcecc. However, this style of connectivity results in suboptimal code due to the additional register copies which introduce additional moves, consume registers, and produce dependencies to the code which hinder parallelism.
Not supported by tcecc. However, any connectivity type is supported by the TCE assembler. Thus, one can resort to manual TTA assembly programming in this case.
Pekka Jääskeläinen 2018-03-12