Subsections


1 TCE Tour

This tutorial goes through most of the tools in TCE using a fairly simple example application. It starts from C code and ends up with VHDL of the processor and a bit image of the parallel program. This tutorial will also explain how to accelerate an algorithm by customizing the instruction set, i.e., by using custom operations. The total time to go through the tutorial is about $2-3$ hours.

The tutorial file package is available at:

http://tce.cs.tut.fi/tutorial_files/tce_tutorials.tar.gz

Fetch and unpack it to a working directory and then enter the directory:

> wget http://tce.cs.tut.fi/tutorial_files/tce_tutorials.tar.gz
> tar -xzf tce_tutorials.tar.gz
> cd tce_tutorials/tce_tour
> ls -la

total 84
drwxr-xr-x 3 tce tce  4096 2010-05-28 11:40 .
drwx------ 7 tce tce  4096 2012-05-18 13:22 ..
-rw------- 1 tce tce  5913 2010-03-08 20:01 crc.c
-rw------- 1 tce tce  1408 2008-11-07 11:35 crc.h
-rw------- 1 tce tce  3286 2008-11-07 11:35 crcTable.dat
-rw-r--r-- 1 tce tce  2345 2010-03-08 13:04 custom_operation_behavior.cc
-rw-r--r-- 1 tce tce   855 2010-05-28 11:41 custom_operations.idf
-rw------- 1 tce tce  1504 2010-03-08 20:01 main.c
-rw-r--r-- 1 tce tce 45056 2010-03-10 16:09 tour_example.hdb
drwxr-xr-x 2 tce tce  4096 2010-05-28 11:40 tour_vhdl

1 The Sample Application

The test application counts a 32-bit CRC (Cyclic Redundant Check) check value for a block of data, in this case 10 bytes. The C code implementation is written by Michael Barr and it is published under Public Domain. The implementation consists of two different version of crc, but we will be using the fast version only.

The program consists of two separate files: main.c contains the simple main function and crc.c contains the actual implementation. Open crc.c in your preferred editor and take a look at the code. Algorithm performs modulo-2 division, a byte at a time, reflects the input data (MSB becomes LSB) depending on the previous remainder, and XORs the remainder with a predefined polynomial (32-bit constant). The main difference between the crcSlow and crcFast implementations is that crcFast exploits a precalculated lookup table. This is a quite usual method of algorithm optimization.


2 Starting Point Processor Architecture

Copy the minimal.adf file included in TCE distribution to a new ADF file:

  cp $(tce-config --prefix)/share/tce/data/mach/minimal.adf start.adf

The file describes a minimalistic architecture containing just enough resources that the TCE compiler can still compile programs for it. We use this minimal architecture as the starting point. Function units in minimal.adf are selected from the hardware database (HDB, Section 2.2.3) so we are able to generate a VHDL implementation of the processor automatically later in the tutorial.

You can view the architecture using the graphical Processor Designer (ProDe, Section 4.1) tool:

  prode start.adf &

Fig. 3.1 shows how Prode should look like. There is 1 bus and 5 units: global control unit (GCU), 2 register files, 1 ALU, and 1 LSU.

Figure 3.1: Processor Designer (prode) when minimal ADF is opened
[width=10cm]eps/prode_scrshot

You'll learn how to edit the TTA with ProDe later in this tutorial.

If you want later to simulate the generated processor using GHDL (Section 3.1.9), you should decrease the amount of memory in the processor already now. Otherwise the GHDL executing the generated testbench might consume tremendous amount of memory on your computer and even crash. Select Edit -> Address Spaces in ProDe. Then edit the bit widths of data and instruction address spaces and set them to $15$ bits which should be plenty for our case.


3 Compiling and simulating

Now we want to know how well the starting point architecture executes our program. We must compile the source code for this architecture with command:

tcecc -O3 -a start.adf -o crc.tpef -k result main.c crc.c

In addition to source codes, the compiler needs the TTA architecture definition start.adf. It will produce a parallel program called crc.tpef which can only be executed on this architecture. The switch $-k$ tells the compiler to keep the result symbol (variable on line 29 of main.c) also in the generated program in order to access the variable contents by name in the simulator later on.

Successful compilation does not print anything and we can now simulate the program. Let's use graphical user interface version of the simulator called Proxim (Section 6.1.6):

proxim start.adf crc.tpef &

The simulator will load the architecture definition and the program and wait for commands. Proxim shows one line per instruction. Minimal TTA has only bus and hence there is only instruction slot (column). The total number of instructions is about $1~200$.

Execute the program by clicking ``Run'' and Proxim should look like Fig. 3.2. Write down this cycle count in the bottom bar of the simulator, you can use Table 3.1. for future comparison.

Figure 3.2: Processor Simulator (proxim) when minimal ADF and crc.tpef are opened and simulated.
[width=10cm]eps/proxim_scrshot.eps

You can check the result straight from the processor's memory by writing this command to the command line at the bottom of the simulator:

x /u w result

The command specifies the data width (4 bytes) and the result symbol we told the compiler to preserve before. The correct checksum result is 0x62488e82.

Processor resource utilization data can be viewed with command:

info proc stats

This will output a lot of information like the utilization of transport buses, register files and function units, for example:

utilizations
------------
buses:
B1              87.7559% (4415 writes)

sockets:
lsu_i1          17.4319% (877 writes)
lsu_o1          10.4949% (528 writes)
...

operations executed in function units:
LSU:
...
TOTAL           17.4319% (877 triggers)

ALU:
...
TOTAL           25.2037% (1268 triggers)
...

register accesses:
RF:
0               855 reads,     0 guard reads,      1 writes       
...

Your output may vary depending on the compiler version, the optimizations used and your architecture. For more information on the simulator's command line tools and syntax, you may use the

help

command.

Proxim can also show othes various pieces of information about the program's execution and its processor resource utilization. For example to check the utilization of the resources of our architecture, select View>Machine Window from the top menu. The parts of the processor that are utilized the most are visualized with darker red color.

Figure 3.3: Processor Simulator (proxim) shows the utilizations of each FU and bus.
[width=0.7]eps/proxim_util.eps


Table 3.1: Cycle counts in different TTA configurations. You'll fill this during the tutorial
    Approx. relative    
ADF Cycle count cycle count Section(s) Comment
start.adf   $1~~$ 3.1.3  
large.adf   $1/2~$ 3.1.4 start + 3 buses
large.adf   $1/5~$ -''- start + 3 buses + RF
large.adf   $1/6~$ -''- start + 3 buses + RF + new ALU
custom.adf   $1/5~$ 3.1.5 - 3.1.9  
large_custom.adf   $1/50$ 3.1.10  
       

There are 3 basic ways to accelerate computation in TTAs:

  1. Modify the algorithm itself. Beyond the scope of this tutorial.
  2. Add more basic resources (FUs, RFs, buses) to TTA
  3. Add custom operation(s) to TTA


4 Increasing performance by adding resources

As the current architecture is minimalistic we can increase the performance even further by adding resources to the processor.

1 Transport buses.

The architecture has only one transport bus, thus the compiler can't exploit any instruction level parallelism. Let's start the architecture customization by adding another transport bus. After this there can be 2 moves per clock cycle. First copy the current architecture:

cp start.adf large.adf

and open the new architecture in ProDe:

prode large.adf &

A new transport bus can be added simply by selecting the current bus and pressing ``ctrl+c'' to copy the bus and then pressing ``ctrl+v'' to paste it. Add 3 more buses. After you have added the buses you have to connect it to the sockets. Easiest way to do this is to select ``Tools->Fully connect IC''. Save the architecture, recompile the source code for the new architecture:

tcecc -O3 -a large.adf -o crc.tpef -k result crc.c main.c

This time simulate the program using the command line instruction set simulator:

ttasim -a large.adf -p crc.tpef

The command will load the architecture and program and execute it. Once the simulation is finished, you should see the ttasim command prompt:

> ttasim -a large.adf -p crc.tpef

(ttasim)

Now when you check the cycle count from the simulator:

info proc cycles

You might see a significant drop in cycles. Also check the processor utilization statistics from the simulator with command:

info proc stats

2 Register files.

From the previous simulator statistics you can see from ``operations'' table that there are a lot of load and store operations being executed. As the architecture has only 5 general purpose registers this tells us that there are a lot of register spilling to memory. Let's try how the amount of registers affect the cycle count. There are two options how we can add registers. We can either increase the number of registers in a register file or add a new register file.

Let's try the latter option because this way we increase the number of registers that can be accessed simultaneously on one clock cycle. This can be done by selecting the RF and using copy and paste. Then connect it to the IC. Simulation statistics should indicate performance increase. As expected, the number of load and store operations decreased. But notice also that the number of add operations decreased quite a lot. The reason is simple: addition is used to calculate stack memory addresses for the spilled registers.

3 Function units.

Next subject for the bottleneck is the ALU as now all the basic operations are performed in a single function unit. From the simulator statistics you can see that logical operations and addition are quite heavily utilized. Instead of duplicating the ALU let's add more specific FUs from the Hardware Database. Select ``Edit->Add From HDB->Function Unit...''. Select a FU which has operations and(1), ior(1), xor(1) and click ``Add''. Then select FU with operation add(1) and click ``Add''. Close the dialog, connect the function units and save the architecture. Recompile and simulate to see the effect on cycle count.

The architecture could be still modified even further to drop the cycle count but let's settle for this now and move on to custom operations.


5 Analyzing the Potential Custom Operations

Custom operations implement application specific functionality in TTA processors. This part of the tutorial teaches how to accelerate the CRC computation by adding a custom operation to the starting point processor design.

First of all, it is quite simple and efficient to implement CRC calculation entirely on hardware. Naturally, using the whole CRC function as a custom operation would be quite pointless and the benefits of using a processor based implementation would get smaller. Instead, we will concentrate on trying to accelerate smaller parts of the algorithm, which is the most useful way for most algorithms.

The steps needed for custom opeation are:

  1. Analyze the code for bottlenecks

  2. Speculatively add a new module and operation to operation database with OSEd. Define the behavior of the new operation (copy-paste from original code and add few macros). Optionally, you can also simulate the operation behavior. Bind the new operation to a new function unit in ProDe

  3. Modify your source code so that it utilizes the new operation and simulate. Go back to step 2 if you wish to explore other operations

  4. Create VHDL for the operation and add it to HDB.

  5. Finally, generate HDL implementation of the TTA with ProGe.

Finding the operation to be optimized is quite obvious in this case if you look at function crcFast(). It consists of a for-loop in which the function reflect() is called through the macro REFLECT_DATA. If you look at the actual function (shown also in Fig. 3.4 you can see that it is quite simple to implement on hardware, but requires many instructions if done with basic operations in software. The function ``reflects'' the bit pattern around the middle point like a mirror. For example, the bit pattern 1101 0010 becomes 0100 1011. The main complexity of the function is that the bit pattern width is not fixed. Fortunately, the width cannot be arbitrary. If you examine the crcFast()-function and the reflect macros you can spot that function reflect() is only called with 8 and 32 bit widths (unsigned char and 'crc' which is an unsigned long).

A great advantage of TCE is that the operation semantics, processor architecture, and implementation are separate abstractions. This simplifies designing custom operations since you can simulate your design by simply defining the simulation behaviour of the operation and setting the latency of the operation to the processor architecture definition. This is nice as you do not need an actual hardware implementation of the operation at this point of the design, but can evaluate different custom operation possibilities at the architectural level. However, this brings up an awkward question: how to determine the latency of the operation? Unrealistic or too pessimistic latency estimates can produce inaccurate performance results and bias the analysis.

One approach to the problem is to take an educated guess and simulate some test cases with different custom operation latencies. This way you can determine a latency range in which the custom operation would accelerate your application to the satisfactory level. For example, custom operation with 1 cycle delay might give $3x$ speedup but $10$ cycles gives just 3%. After this you can sketch how the operation could be implemented in hardware, or consult someone knowledgeable in hardware design to figure out whether the custom operation can be reasonably implemented within the latency constraint.

Figure 3.4: Reflect function of CRC
[width=0.8]eps/reflect

Another approach is to try and determine the latency by examining the operation itself and considering how it could be implemented. This approach requires some insight in digital design.

Besides latency you should also consider the size of the custom function unit. It will consume extra die area, but the size limit is always case-specific, especially when area taken by memories is accounted. Accurate size estimation requires the actual HW implementation and synthesis.

Let us consider the reflect function. If we had fixed width we could implement the reflect by hard wiring (and registering the output) because the operation only moves bits to other locations in the word. This could be done easily in one clock cycle, as in right side of Fig. 3.4. But we need two different bit widths so it is somewhat more complicated. We could design the HW in such way that it has two operations: one for 8-bit data and another for 32-bit data. One way to implement this is to have 32-bit wide crosswiring and register the output. The 8-bit value would be reflected to the 8 MSB bits of the 32-bit wiring. Then we need to move the 8 MSB bits to the LSB end and set rest to zero. This moving can be implemented using multiplexers. So concerning the latency this can all be done easily within one clock cycle as there is not much logic needed.


6 Creating the Custom Operation

Now we have decided the operation to be accelerated and its latency.

Next we will create a function unit implementing the operation and add it to our processor design. First, a description of the semantics of the new operation must be added at least to Operation Set Abstraction Layer (Sections 2.2.6 and 4.3). OSAL stores the semantic properties of the operation, which includes the simulation behavior, operand count etc., but not the latency. OSAL definitions can be added by using the OSAL GUI, OSEd (Section 4.2.1).

Synthesis or simulation at the VHDL level requires that at least one function unit implementing the operation must be added to the Hardware Database (Section 2.2.3). In this tutorial we add the FU implementation for our custom operation so the processor implementation can be generated, but omit the cost data required for the cost estimation. In general case, cost data should be added to cost database as well.

1 Using Operation Set Editor (OSEd) to add the operation data.

OSEd is started with the command

osed &

Create a new operation module, which is a container for a set of operations. You can add a new module in any of the predefined search paths, provided that you have sufficient file system access permissions.

For example, choose directory /home/user/.tce/opset/custom, where user is the name of the user account being used for the tutorial. This directory is intended for the custom operations defined by the current user, and should always have sufficient access rights.

Figure 3.5: Operation Set Editor when adding new operation REFLECT32.
[width=0.8]eps/osed

  1. Click the root in the left area of the main window which opens list of paths. Right-click on a path name /home/user/.tce/opset/custom. A drop-down menu appears below the mouse pointer.
  2. Select Add module menu item.
  3. Type in the name of the module (for example, `tutorial') and press OK. The module is now added under the selected path.

2 Adding the new operations.

We will now add the operation definitions to the newly created operation module.

  1. Select the module that you just added by right-clicking on its name, displayed in the left area of the main window. A drop down menu appears.
  2. Select Add operation menu item.
  3. Type `REFLECT8' as the name of the operation.
  4. Add one input by pressing the Add button under the operation input list. Select UIntWord as type.
  5. Add one output by pressing the Add button under the operation output list. Select UIntWord as type.
  6. After the inputs and the output of the operation have been added, close the dialog by pressing the OK button. A confirmation dialog will pop up. Press Yes to confirm the action. The operation definition is now added to the module.
  7. Then repeat the steps for operation `REFLECT32', as shown in Fig. 3.5

3 Defining the simulation behaviour of the operations

The new operations REFLECT8 and REFLECT32 do not yet have simulation behavior models, so we cannot simulate programs that use these operations with the TCE processor simulator. Open again the operation property dialog by right-clicking REFLECT8, then choosing Modify properties. Now press the Open button at the bottom to open an empty behavior source file for the module. Copy-paste (or type if you have the time!) the following code in the editor window:

#include "OSAL.hh"
OPERATION(REFLECT8)
 TRIGGER

 unsigned long data = UINT(1);
 unsigned char nBits = 8;

 unsigned long  reflection = 0x00000000;
 unsigned char  bit;

 /*
  * Reflect the data about the center bit.
  */
 for (bit = 0; bit < nBits; ++bit)
 {
     /*
      * If the LSB bit is set, set the reflection of it.
      */
     if (data & 0x01)
     {
         reflection |= (1 << ((nBits - 1) - bit));
     }

     data = (data >> 1);
 }

 IO(2) = static_cast<unsigned> (reflection);

 return true;
 END_TRIGGER;
 END_OPERATION(REFLECT8)

OPERATION(REFLECT32)
 TRIGGER

 unsigned long data = UINT(1);
 unsigned char nBits = 32;

 unsigned long  reflection = 0x00000000;
 unsigned char  bit;

 /*
  * Reflect the data about the center bit.
  */
 for (bit = 0; bit < nBits; ++bit)
 {
     /*
      * If the LSB bit is set, set the reflection of it.
      */
     if (data & 0x01)
     {
         reflection |= (1 << ((nBits - 1) - bit));
     }

     data = (data >> 1);
 }

 IO(2) = static_cast<unsigned> (reflection);

 return true;
 END_TRIGGER;
 END_OPERATION(REFLECT32)

This code has the behaviours for the both operations. These behavior definitions reflect the input operand integer (id 1, assigned to variable 'data') and writes the result to the ''output operand`` (id 2, assigned from variable 'reflection') which is the first output and signals the simulator that all results are computed successfully.

Open file crc.c in your preferred editor. Compare the behaviour definition of reflect operations and the original reflect-function. The function is mostly similar except for parameter passing and adding few macros added, such as OPERATION(), TRIGGER, IO(2) etc. Custom hardware operation behavior reads data from the function unit's input ports and writes to output ports. The value of nBits is determined from the operation code (REFLECT8 or REFLECT32).

Save the code and close the editor. REFLECT8 and REFLECT32 operations now have TCE simulator behaviour models.

4 Compiling operation behavior.

REFLECT-operations have been added to the test module. Before we can simulate the behavior of our operation, the C++-based behavior description must be compiled to a plugin module that the simulator can call.

  1. Right-click on the module name ('tutorial') displayed in the left area to bring up the drop down menu.
  2. Select Build menu item.
  3. Hopefully, no errors were found during the compilation! Otherwise, re-open the behaviour source file and try to locate the errors with the help of the diagnostic information displayed in the build dialog.

After the operation simulation model has been added and compiled the operation can be simulated. But for the sake of speed up we will skip the operation simulation here. Interested readers are referred to Section 4.2.1 You can close osed now.

5 Adding a Customized Function Unit to the Architecture.

Now the operation definitions of the custom operations have been added to the Operation Set Abstraction Layer (OSAL) database. Next we need to add at least one functional unit (FU) which implements these operations so that they can be used in the processor design. Note the separation between ''operation`` and an ''function unit`` that implements the operation(s) which allows using the same OSAL operation definitions in multiple FUs with different latencies.

First, add the architecture of the FU that implements the custom operations to the starting point processor architecture. Let's take a copy of the starting point processor design which we can freely modify and still be able to easily compare the architecture with and without the custom operation support later on:

cp start.adf custom.adf

Open the copy in ProDe:

prode custom.adf &

Figure 3.6: Adding operation REFLECT8 to a new function unit in ProDe
[width=0.8]eps/reflecter

Then:

  1. Add a new function unit to the design, right click the canvas and select: Add>Function Unit. Name the FU ''REFLECTER``. Add one output port (named output1) and one input port (trigger), to the FU in the Function unit dialog. When adding the input port, set the port as triggering by checking the Triggers checkbox. This port starts the execution of the operation when it is written to.
  2. Add the operation ''REFLECT8`` we defined to the FU: Add from opset>REFLECT8>OK and set the latency to 1. Click on the REFLECT8 operation and ensure that the operation input is bound to the input ports and the output is bound to the output port. Check that the operand usage is in such a way that input is read at cycle 0 and the result is written at the end of the cycle (can be read from the FU on the next cycle), as in Fig. 3.6. Thus, the latency of the operation is 1 clock cycles.
  3. Repeat the previous step for operation ''REFLECT32``
  4. Now an FU that supports the custom operations has been added to the architecture. Next, connect the FU to the rest of the architecture. This can be most easliy done by selecting Tools->Fully Connect IC which connects all FUs to all the buses. Save the architecture description by clicking Save.

7 Use the custom operation in C code.

To get some benefits from the added custom hardware, we must use it from the C code. This is done by replacing a C statement with a custom operation invocation.

Let us first backup the original C code.

cp crc.c crc_with_custom_op.c

Then open crc_with_custom_op.c in your preferred text editor.

  1. Add #include ``tceops.h'' to the top of the file. This includes automatically generated macros which allow us to use specific operations from C code without getting our hands dirty with inline assembly.

    Usage of these macros is as follows:

     _TCE_<opName>(input1, ... , inputN, output1, ... , outputN);
    

    where <opName> is the name of the operation in OSAL. Number of input and output operands depends on the operation. Input operands are given first and they are followed by output operands if any.

    In our case we need to write operands into the reflecter and read the result from it. We named the operations ``REFLECT8'' and ``REFLECT32'', thus the macros we are going to use are as follows:

     _TCE_REFLECT8(input1, output);
     _TCE_REFLECT32(input1, output);
    

    Now we will modify the crcFast function to use the custom op. First declare 2 new variables at the beginning of the function:

     crc input;
     crc output;
    

    These will help in using the reflect FU macro.

    Take a look at the REFLECT_DATA and REFLECT_REMAINDER macros. The first one has got a magic number 8 and ``variable'' X is the data. This is used in the for-loop of crcFast().

    The input data of reflect function is read from array message[] in the for-loop the. Let us modify this so that at the beginning of the loop the input data is read to the input variable. Then we will use the _TCE_REFLECT8 macro to run the custom operations, and finally replace the REFLECT_DATA macro with the output variable. After these modifications the body of the for-loop should look like this:

     input = message[byte];
     _TCE_REFLECT8(input, output);
     data = (unsigned char) output ^ (remainder >> (WIDTH - 8));
     remainder = crcTable[data] ^ (remainder << 8);
    

    Next we will modify the return statement. Originally it uses REFLECT_REMAINDER macro where nBits is defined as WIDTH and data is remainder. Simply use _TCE_REFLECT32 macro before return statement and replace the original macro with the variable output:

     _TCE_REFLECT32(remainder, output);
     return (output ^ FINAL_XOR_VALUE);
    

    And now we are ready. Remember to save the file.

  2. Compile the custom operation using C code to a parallel TTA program using the new architecture which includes a FU with the custom operation:

    tcecc -O3 -a custom.adf -o crc_with_custom_op.tpef -k result \
    crc_with_custom_op.c main.c

  3. Simulate the parallel program. This time we will use the command line simulator ttasim. We will also enable writing of bus trace. It means that the simulator writes a text file containing the bus values of the processor from every executed clock cycle. This bus trace data will be used to verify the processor RTL implementation. Start the simulator with command:

    ttasim

    Then enable the bus trace setting:

    setting bus_trace 1

    Load architecture and program and run the simulation

    mach custom.adf

    prog crc_with_custom_op.tpef

    run

    Verify that the result is the same as before (x /u w result). It should be the same as earlier (0x62488e82). Check the cycle count info proc cycles and compare it to the cycle count of the version which does not use a custom operation. You should see a very noticeable drop compared to the starting point architecture without the custom operations. Write this cycle count down for a later step and quit ttasim.

    The simulator execution also created a file crc_with_custom_op.tpef.bustrace which contains the bus trace.


8 Adding HDL implementation of the FU to the hardware database (HDB).

Now we have seen that the custom operation accelerates our application. Next we'll add a VHDL implementation of the custom FU to Hardware Database (hdb). This way we will be able to generate a VHDL implementation of our processor.

If you want to skip this phase you can use the given tour_example.hdb instead of creating it yourself.

Start HDBEditor (see Section 4.8):

hdbeditor &

TCE needs some data of the FU implementation in order to be able to automatically generate processors that include the FU.

  1. function unit from custom.adf file (edit->add->FU architecture from ADF). You can leave the checkboxes ''parametrized width`` and ''guard support`` unchecked, which is the default. Then define implementation for the added function unit entry right click reflect -> Add implementation....

  2. Open file tour_vhdl/reflect.vhdl that was provided in the tutorial package with the editor you prefer, and take a look. This is an example implementation of a TTA function unit performing the custom 'reflect8' and 'reflect32' operations.

    Figure 3.7: Adding the implementation of custom operations to HW database
    [width=0.8]eps/hdbeditor_scrshot

  3. The HDB implementation dialog needs the following information from the VHDL:

    1. Name of the entity and the naming of the FU interface ports.

    Name the implemention after the top level entity: ``fu_reflect''.

    By examining the VHDL code you can easily spot the clock port (clk), reset (rstx) and global lock port (glock). Operation code (opcode) port is t1opcode. Write these into the appropriate text boxes. You do not have to fill the Global lock req. port field because the function unit does not stall, i.e. does not request a global lock to the processor during its execution.

    2. Opcodes.

    Check that the operation codes match the top of the vhdl file. REFLECT32 has operation code ``0'' and REFLECT8 has operation code ``1''.

    The operation codes must be always numbered according to the alphabetical order of the OSAL operation names, starting at 0. For example, in this case REFLECT32 is earlier than REFLECT8 in the alphabetical order (i.e. 3 becomes before 8).

    3. Parameters.

    Parameters can be found from the VHDL generics. On top of the file there is one parameter: busw. It tells the width of the transport bus and thus the maximum width of input and output operands.

    Thus, add parameter named busw, type it as integer and set the value to 32 in the Parameter dialog.

    4. Architecture ports.

    These settings define the connection between the ports in the architectural description of the FU and the VHDL implementation. Each input data port in the FU is accompanied with a load port that controls the updating of the FU input port registers.

    Choose a port in the Architecture ports dialog and click edit. Name of the architecture port p1 is t1data (as in VHDL) and associated load port is t1load. Width formula is the parameter busw. No guard port is needed in this case.

    Name the output port (p2) to r1data and the width formula is now busw because the output port writes to the bus. The output port does not have a load port.

    5. Add VHDL source file.

    Add the VHDL source file into the Source code dialog. Notice that the HDL files must be added in the compilation order (see section 4.8). But now we have only one source file so we can simply add it without considering the compilation order (Add -> Browse -> tour_vhdl/reflect.vhdl).

    Now you are done with adding the FU implementation, see Fig. 3.7. Click OK.


9 Generating the VHDL and memory images

In this step we generate the VHDL implementation of the processor, and the bit image of the parallel program.

1 Select Function Unit Implementations

You can either use the given custom_operations.idf included in the tutorial files or select the implementations yourself. If you use the given file replace custom.idf with custom_operations.idf in the following commands.

Next, we must select implementations for all components in the architecture. Each architecture component can be implemented in multiple ways, so we must choose one implementation for each component to be able to generate the implementation for the processor.

This can be done in the ProDe tool:

prode custom.adf

Then we'll select implementations for the FUs which can be done in Tools>Processor Implementation.... Note that the selection window is not currently very informative about the different implementations, so a safe bet is to select an implementation with parametrizable width/size.

  1. Selecting implementations for register files, immediate units and function units can be done two different ways.

    1. Manual selection - user chooses the implementations from HDB files.

      1. Select implementation for RF: Click the RF name, 'Select Implementation', find the TCE's default HDB file (PREFIX/share/tce/hdb/asic_130nm_1.5V.hdb) from your tce installation path and select an implementation for the RF from there.

      2. Next select implementation for the boolean RF like above. But this time select an implementation which is guarded i.e. select an implementation which has word ``guarded_0'' in its name.

      3. Similarly, select implementations for the function units from TCE's default HDB. Notice that it is vital that you choose the implementation for LSU from the asic_130nm_1.5V.hdb. Then select implementation for the reflecter but this time you have to use the tour.hdb created earlier to find the FU we added that supports the REFLECT custom operations.

    2. Automatic selection - ProDe chooses the implementations from HDB files.

      1. Select implementations for register files and function units all at once: Click Auto Select Implementations, find the TCE's default HDB file from your TCE installation path (PREFIX/share/tce/hdb/asic_130nm_1.5V.hdb), make sure 'Register Files' and 'Function Units' checkboxes are checked and press 'Find'. A dialog should appear saying 4 implementations were found; 2 for register files and 2 for function units. Click 'Yes'. If the number of found implementations was under 4 in your case, refer to the manual implementation selection above.

      2. Browse to Function Units page if you were not on it already. You should see, that reflecter FU is still without implementation (because it is not defined in asic_130nm_1.5V.hdb). Select Auto Select Implementations again, find tour.hdb as the HDB file, and press 'Find'. 1 FU implementation should be found (the reflecter), click 'Yes'. Now all your register files and function units should have implementations.

  2. Next select the IC/Decoder generator plugin used to generate the decoder in the control unit and interconnection network: Browse... (installation_path)/share/tce/icdecoder_plugins/base/ DefaultICDecoderPlugin.so>OK. This should be selected by default.

  3. Enable bus tracing from the Implementation-dialog's IC / Decoder Plugin tab. Set the bustrace plugin parameter to ``yes'' and the bustracestartingcycle to ``5''. The IC will now have a component which writes the bus value from every cycle to a text file. Notice that this option cannot be used if the processor is synthesized.

    You do not have to care about the HDB file text box because we are not going to use cost estimation data.

  4. Click ``Save IDF...''

2 Generate the VHDL for the processor using Processor Generator (ProGe).

You can start processor generation from ProDe's implementation selection dialog: Click ``Generate Processor''. For Binary Encoding Map: Select the ``Generate new'', see Fig. 3.8

In the target directory click ``Browse'' and create a new directory proge-output and select it. Then click OK to create the processor.

Figure 3.8: Generating TTA VHDL with ProGe
[width=0.8]eps/proge_scrshot

Or alternatively execute ProGe from command line:

generateprocessor -t -i custom.idf -o proge-output custom.adf

Flag $-t$ generates a testbench, $-i$ defines the implementation file, and $-o$ the output directory. Now directory proge-output includes the VHDL implementation of the designed processor except for the instruction memory width package which will be created by Program Image Generator. You can take a look what the directory includes, how the RF and FU implementations are collected up under vhdl subdir and the interconnection network has been generated to connect the units (the gcu_ic subdir). The tb subdir contains testbench files for the processor core. Moreover, there are 4 shell scripts for compiling and simulating the VHDL codes in Modelsim and GHDL.

3 Generate instruction memory bit image using Program Image Generator.

Finally, to get our shiny new processor some bits to chew on, we use generatebits to create instruction memory and data memory images:

generatebits -d -w 4 -p crc_with_custom_op.tpef -x proge-output custom.adf

Flag $-d$ creates data images, $-w 4$ set data memory width 4 MAUs (Minimum Addressable Units, bytes in this case), $-p$ defines the program, $-x$ defines the HDL directory, and adf is given without a flag.

Now the file crc_with_custom_op.img includes the instruction memory image in ``ascii 0/1'' format. Each line in that file represents a single instruction. Thus, you can get the count of instructions by counting the lines in that file:

 wc -l crc_with_custom_op.img

Accordingly, the file crc_with_custom_op_data.img contains the data memory image of the processor. Program Image Generator also created file proge-output/vhdl/tta0_imem_mau_pkg.vhdl which contains the width of the instruction memory (each designed TTA can have a different instruction width). The _imem_mau_pkg.vhdl file is appended with the top level entity name, which in this case is ``tta0''.

4 Simulation and verification

If you have GHDL or Modelsim installed you can now simulate the processor VHDL. First cd to proge-output directory:

cd proge-output

Then compile and simulate the testbench. With GHDL:

./ghdl_compile.sh

./ghdl_simulate.sh

Or with Modelsim:

./modsim_compile.sh

./modsim_simulate.sh

Compilation gives a couple of warnings from Synopsys' std_arith package but that should do no harm. Simulation will take some time as the bus trace writing is enabled. If you did not change the memory widths in Prode (Section 3.1.2 ), the simulation will crash and print ``Killed''. There will be many warnings at $0 ns$ due uninitialized signals but you can ignore them.

After simulation, you should see a message ./testbench:info: simulation stopped by --stop-time The simulation produces file ``bus.dump'' which looks like this:

0,00000000
1,00000004
2,00007ff8
3,00000018
    ...

As the testbench is ran for constant amount of cycles we need to get the relevant part out of the bus dump for verification. This can be done with command:

head -n <number of cycles> bus.dump > sim.dump

where the <number of cycles> is the number of cycles in the previous ttasim execution (same as line ocunt in crc_with_custom_op.tpef.bustrace). Then compare the trace dumps from the VHDL simulation and the architecture simulation:

diff -u sim.dump ../crc_with_custom_op.tpef.bustrace

If the command does not print anything the dumps were equal and the RTL simulation matches the ttasim. Now you have succesfully added a custom operation, verified it, and gained a notable performance increase. Well done!


10 Further acceleration by adding custom operation to large TTA

As a final step, let us add the custom operation to the large TTA.

cp large.adf large_custom.adf

prode large_custom.adf

Add the reflecter FU from the tour.hdb like done earlier. Fully connect the IC, as in Fig.3.9 and save.

Figure 3.9: Processor Designer (prode) with large_custom.adf
[width=10cm]eps/prode_largecustom

tcecc -O3 -a large_custom.adf -o crc_with_custom_op.tpef -k result crc_with_custom_op.c main.c

ttasim -a large_custom.adf -p crc_with_custom_op.tpef

info proc cycles

Voila! Now you have a lightning fast TTA.

11 Final Words

This tutorial is now finished. Now you should know how to customize the processor architecture by adding resources and custom operations, as well as generate the processor implementation and its instruction memory bit image.

This tutorial used a ``minimalistic'' processor architecture as our starting point. The machine had only 1 transport bus and 5 registers so it could not fully exploit the parallel capabilities of TTA. Then we increased resources in the processor and the cycle count dropped to half. Adding 2 simple custom operations to the starting point architecture gives a similar speedup and in large machine the speedup in cyclecount is about $50x$.

Pekka Jääskeläinen 2018-03-12