This tutorial goes through most of the tools in TCE using a fairly simple example application. It starts from C code and ends up with VHDL of the processor and a bit image of the parallel program. This tutorial will also explain how to accelerate an algorithm by customizing the instruction set, i.e., by using custom operations. The total time to go through the tutorial is about hours.
The tutorial file package is available at:
http://tce.cs.tut.fi/tutorial_files/tce_tutorials.tar.gz
Fetch and unpack it to a working directory and then enter the directory:
> wget http://tce.cs.tut.fi/tutorial_files/tce_tutorials.tar.gz > tar -xzf tce_tutorials.tar.gz > cd tce_tutorials/tce_tour > ls -la total 84 drwxr-xr-x 3 tce tce 4096 2010-05-28 11:40 . drwx------ 7 tce tce 4096 2012-05-18 13:22 .. -rw------- 1 tce tce 5913 2010-03-08 20:01 crc.c -rw------- 1 tce tce 1408 2008-11-07 11:35 crc.h -rw------- 1 tce tce 3286 2008-11-07 11:35 crcTable.dat -rw-r--r-- 1 tce tce 2345 2010-03-08 13:04 custom_operation_behavior.cc -rw-r--r-- 1 tce tce 855 2010-05-28 11:41 custom_operations.idf -rw------- 1 tce tce 1504 2010-03-08 20:01 main.c -rw-r--r-- 1 tce tce 45056 2010-03-10 16:09 tour_example.hdb drwxr-xr-x 2 tce tce 4096 2010-05-28 11:40 tour_vhdl
The test application counts a 32-bit CRC (Cyclic Redundant Check) check value for a block of data, in this case 10 bytes. The C code implementation is written by Michael Barr and it is published under Public Domain. The implementation consists of two different version of crc, but we will be using the fast version only.
The program consists of two separate files: main.c contains the simple main function and crc.c contains the actual implementation. Open crc.c in your preferred editor and take a look at the code. Algorithm performs modulo-2 division, a byte at a time, reflects the input data (MSB becomes LSB) depending on the previous remainder, and XORs the remainder with a predefined polynomial (32-bit constant). The main difference between the crcSlow and crcFast implementations is that crcFast exploits a precalculated lookup table. This is a quite usual method of algorithm optimization.
Copy the minimal.adf file included in TCE distribution to a new ADF file:
cp $(tce-config --prefix)/share/tce/data/mach/minimal.adf start.adf
The file describes a minimalistic architecture containing just enough resources that the TCE compiler can still compile programs for it. We use this minimal architecture as the starting point. Function units in minimal.adf are selected from the hardware database (HDB, Section 2.2.3) so we are able to generate a VHDL implementation of the processor automatically later in the tutorial.
You can view the architecture using the graphical Processor Designer (ProDe, Section 4.1) tool:
prode start.adf &
Fig. 3.1 shows how Prode should look like. There is 1 bus and 5 units: global control unit (GCU), 2 register files, 1 ALU, and 1 LSU.
You'll learn how to edit the TTA with ProDe later in this tutorial.
If you want later to simulate the generated processor using GHDL (Section 3.1.9), you should decrease the amount of memory in the processor already now. Otherwise the GHDL executing the generated testbench might consume tremendous amount of memory on your computer and even crash. Select Edit -> Address Spaces in ProDe. Then edit the bit widths of data and instruction address spaces and set them to bits which should be plenty for our case.
tcecc -O3 -a start.adf -o crc.tpef -k result main.c crc.c
In addition to source codes, the compiler needs the TTA architecture definition start.adf. It will produce a parallel program called crc.tpef which can only be executed on this architecture. The switch tells the compiler to keep the result symbol (variable on line 29 of main.c) also in the generated program in order to access the variable contents by name in the simulator later on.
Successful compilation does not print anything and we can now simulate the program. Let's use graphical user interface version of the simulator called Proxim (Section 6.1.6):
proxim start.adf crc.tpef &
The simulator will load the architecture definition and the program and wait for commands. Proxim shows one line per instruction. Minimal TTA has only bus and hence there is only instruction slot (column). The total number of instructions is about .
Execute the program by clicking ``Run'' and Proxim should look like Fig. 3.2. Write down this cycle count in the bottom bar of the simulator, you can use Table 3.1. for future comparison.
[width=10cm]eps/proxim_scrshot.eps
|
You can check the result straight from the processor's memory by writing this command to the command line at the bottom of the simulator:
x /u w result
The command specifies the data width (4 bytes) and the result symbol we told the compiler to preserve before. The correct checksum result is 0x62488e82.
Processor resource utilization data can be viewed with command:
info proc stats
This will output a lot of information like the utilization of transport buses, register files and function units, for example:
utilizations ------------ buses: B1 87.7559% (4415 writes) sockets: lsu_i1 17.4319% (877 writes) lsu_o1 10.4949% (528 writes) ... operations executed in function units: LSU: ... TOTAL 17.4319% (877 triggers) ALU: ... TOTAL 25.2037% (1268 triggers) ... register accesses: RF: 0 855 reads, 0 guard reads, 1 writes ...
Your output may vary depending on the compiler version, the optimizations used and your architecture. For more information on the simulator's command line tools and syntax, you may use the
help
command.
Proxim can also show othes various pieces of information about the program's execution and its processor resource utilization. For example to check the utilization of the resources of our architecture, select View>Machine Window from the top menu. The parts of the processor that are utilized the most are visualized with darker red color.
[width=0.7]eps/proxim_util.eps
|
Approx. relative | ||||
ADF | Cycle count | cycle count | Section(s) | Comment |
start.adf | 3.1.3 | |||
large.adf | 3.1.4 | start + 3 buses | ||
large.adf | -''- | start + 3 buses + RF | ||
large.adf | -''- | start + 3 buses + RF + new ALU | ||
custom.adf | 3.1.5 - 3.1.9 | |||
large_custom.adf | 3.1.10 | |||
There are 3 basic ways to accelerate computation in TTAs:
cp start.adf large.adf
and open the new architecture in ProDe:
prode large.adf &
A new transport bus can be added simply by selecting the current bus and pressing ``ctrl+c'' to copy the bus and then pressing ``ctrl+v'' to paste it. Add 3 more buses. After you have added the buses you have to connect it to the sockets. Easiest way to do this is to select ``Tools->Fully connect IC''. Save the architecture, recompile the source code for the new architecture:
tcecc -O3 -a large.adf -o crc.tpef -k result crc.c main.c
This time simulate the program using the command line instruction set simulator:
ttasim -a large.adf -p crc.tpef
The command will load the architecture and program and execute it. Once the simulation is finished, you should see the ttasim command prompt:
> ttasim -a large.adf -p crc.tpef (ttasim)
Now when you check the cycle count from the simulator:
info proc cycles
You might see a significant drop in cycles. Also check the processor utilization statistics from the simulator with command:
info proc stats
From the previous simulator statistics you can see from ``operations'' table that there are a lot of load and store operations being executed. As the architecture has only 5 general purpose registers this tells us that there are a lot of register spilling to memory. Let's try how the amount of registers affect the cycle count. There are two options how we can add registers. We can either increase the number of registers in a register file or add a new register file.
Let's try the latter option because this way we increase the number of registers that can be accessed simultaneously on one clock cycle. This can be done by selecting the RF and using copy and paste. Then connect it to the IC. Simulation statistics should indicate performance increase. As expected, the number of load and store operations decreased. But notice also that the number of add operations decreased quite a lot. The reason is simple: addition is used to calculate stack memory addresses for the spilled registers.
The architecture could be still modified even further to drop the cycle count but let's settle for this now and move on to custom operations.
First of all, it is quite simple and efficient to implement CRC calculation entirely on hardware. Naturally, using the whole CRC function as a custom operation would be quite pointless and the benefits of using a processor based implementation would get smaller. Instead, we will concentrate on trying to accelerate smaller parts of the algorithm, which is the most useful way for most algorithms.
The steps needed for custom opeation are:
Finding the operation to be optimized is quite obvious in this case if you look at function crcFast(). It consists of a for-loop in which the function reflect() is called through the macro REFLECT_DATA. If you look at the actual function (shown also in Fig. 3.4 you can see that it is quite simple to implement on hardware, but requires many instructions if done with basic operations in software. The function ``reflects'' the bit pattern around the middle point like a mirror. For example, the bit pattern 1101 0010 becomes 0100 1011. The main complexity of the function is that the bit pattern width is not fixed. Fortunately, the width cannot be arbitrary. If you examine the crcFast()-function and the reflect macros you can spot that function reflect() is only called with 8 and 32 bit widths (unsigned char and 'crc' which is an unsigned long).
A great advantage of TCE is that the operation semantics, processor architecture, and implementation are separate abstractions. This simplifies designing custom operations since you can simulate your design by simply defining the simulation behaviour of the operation and setting the latency of the operation to the processor architecture definition. This is nice as you do not need an actual hardware implementation of the operation at this point of the design, but can evaluate different custom operation possibilities at the architectural level. However, this brings up an awkward question: how to determine the latency of the operation? Unrealistic or too pessimistic latency estimates can produce inaccurate performance results and bias the analysis.
One approach to the problem is to take an educated guess and simulate some test cases with different custom operation latencies. This way you can determine a latency range in which the custom operation would accelerate your application to the satisfactory level. For example, custom operation with 1 cycle delay might give speedup but cycles gives just 3%. After this you can sketch how the operation could be implemented in hardware, or consult someone knowledgeable in hardware design to figure out whether the custom operation can be reasonably implemented within the latency constraint.
Another approach is to try and determine the latency by examining the operation itself and considering how it could be implemented. This approach requires some insight in digital design.
Besides latency you should also consider the size of the custom function unit. It will consume extra die area, but the size limit is always case-specific, especially when area taken by memories is accounted. Accurate size estimation requires the actual HW implementation and synthesis.
Let us consider the reflect function. If we had fixed width we could implement the reflect by hard wiring (and registering the output) because the operation only moves bits to other locations in the word. This could be done easily in one clock cycle, as in right side of Fig. 3.4. But we need two different bit widths so it is somewhat more complicated. We could design the HW in such way that it has two operations: one for 8-bit data and another for 32-bit data. One way to implement this is to have 32-bit wide crosswiring and register the output. The 8-bit value would be reflected to the 8 MSB bits of the 32-bit wiring. Then we need to move the 8 MSB bits to the LSB end and set rest to zero. This moving can be implemented using multiplexers. So concerning the latency this can all be done easily within one clock cycle as there is not much logic needed.
Now we have decided the operation to be accelerated and its latency.
Next we will create a function unit implementing the operation and add it to our processor design. First, a description of the semantics of the new operation must be added at least to Operation Set Abstraction Layer (Sections 2.2.6 and 4.3). OSAL stores the semantic properties of the operation, which includes the simulation behavior, operand count etc., but not the latency. OSAL definitions can be added by using the OSAL GUI, OSEd (Section 4.2.1).
Synthesis or simulation at the VHDL level requires that at least one function unit implementing the operation must be added to the Hardware Database (Section 2.2.3). In this tutorial we add the FU implementation for our custom operation so the processor implementation can be generated, but omit the cost data required for the cost estimation. In general case, cost data should be added to cost database as well.
OSEd is started with the command
osed &
Create a new operation module, which is a container for a set of operations. You can add a new module in any of the predefined search paths, provided that you have sufficient file system access permissions.
For example, choose directory /home/user/.tce/opset/custom, where user is the name of the user account being used for the tutorial. This directory is intended for the custom operations defined by the current user, and should always have sufficient access rights.
#include "OSAL.hh" OPERATION(REFLECT8) TRIGGER unsigned long data = UINT(1); unsigned char nBits = 8; unsigned long reflection = 0x00000000; unsigned char bit; /* * Reflect the data about the center bit. */ for (bit = 0; bit < nBits; ++bit) { /* * If the LSB bit is set, set the reflection of it. */ if (data & 0x01) { reflection |= (1 << ((nBits - 1) - bit)); } data = (data >> 1); } IO(2) = static_cast<unsigned> (reflection); return true; END_TRIGGER; END_OPERATION(REFLECT8) OPERATION(REFLECT32) TRIGGER unsigned long data = UINT(1); unsigned char nBits = 32; unsigned long reflection = 0x00000000; unsigned char bit; /* * Reflect the data about the center bit. */ for (bit = 0; bit < nBits; ++bit) { /* * If the LSB bit is set, set the reflection of it. */ if (data & 0x01) { reflection |= (1 << ((nBits - 1) - bit)); } data = (data >> 1); } IO(2) = static_cast<unsigned> (reflection); return true; END_TRIGGER; END_OPERATION(REFLECT32)
This code has the behaviours for the both operations. These behavior definitions reflect the input operand integer (id 1, assigned to variable 'data') and writes the result to the ''output operand`` (id 2, assigned from variable 'reflection') which is the first output and signals the simulator that all results are computed successfully.
Open file crc.c in your preferred editor. Compare the behaviour definition of reflect operations and the original reflect-function. The function is mostly similar except for parameter passing and adding few macros added, such as OPERATION(), TRIGGER, IO(2) etc. Custom hardware operation behavior reads data from the function unit's input ports and writes to output ports. The value of nBits is determined from the operation code (REFLECT8 or REFLECT32).
Save the code and close the editor. REFLECT8 and REFLECT32 operations now have TCE simulator behaviour models.
After the operation simulation model has been added and compiled the operation can be simulated. But for the sake of speed up we will skip the operation simulation here. Interested readers are referred to Section 4.2.1 You can close osed now.
Now the operation definitions of the custom operations have been added to the Operation Set Abstraction Layer (OSAL) database. Next we need to add at least one functional unit (FU) which implements these operations so that they can be used in the processor design. Note the separation between ''operation`` and an ''function unit`` that implements the operation(s) which allows using the same OSAL operation definitions in multiple FUs with different latencies.
First, add the architecture of the FU that implements the custom operations to the starting point processor architecture. Let's take a copy of the starting point processor design which we can freely modify and still be able to easily compare the architecture with and without the custom operation support later on:
cp start.adf custom.adf
Open the copy in ProDe:
prode custom.adf &
Then:
To get some benefits from the added custom hardware, we must use it from the C code. This is done by replacing a C statement with a custom operation invocation.
Let us first backup the original C code.
cp crc.c crc_with_custom_op.c
Then open crc_with_custom_op.c in your preferred text editor.
Usage of these macros is as follows:
_TCE_<opName>(input1, ... , inputN, output1, ... , outputN);
where <opName> is the name of the operation in OSAL. Number of input and output operands depends on the operation. Input operands are given first and they are followed by output operands if any.
In our case we need to write operands into the reflecter and read the result from it. We named the operations ``REFLECT8'' and ``REFLECT32'', thus the macros we are going to use are as follows:
_TCE_REFLECT8(input1, output); _TCE_REFLECT32(input1, output);
Now we will modify the crcFast function to use the custom op. First declare 2 new variables at the beginning of the function:
crc input; crc output;
These will help in using the reflect FU macro.
Take a look at the REFLECT_DATA and REFLECT_REMAINDER macros. The first one has got a magic number 8 and ``variable'' X is the data. This is used in the for-loop of crcFast().
The input data of reflect function is read from array message[] in the for-loop the. Let us modify this so that at the beginning of the loop the input data is read to the input variable. Then we will use the _TCE_REFLECT8 macro to run the custom operations, and finally replace the REFLECT_DATA macro with the output variable. After these modifications the body of the for-loop should look like this:
input = message[byte]; _TCE_REFLECT8(input, output); data = (unsigned char) output ^ (remainder >> (WIDTH - 8)); remainder = crcTable[data] ^ (remainder << 8);
Next we will modify the return statement. Originally it uses REFLECT_REMAINDER macro where nBits is defined as WIDTH and data is remainder. Simply use _TCE_REFLECT32 macro before return statement and replace the original macro with the variable output:
_TCE_REFLECT32(remainder, output); return (output ^ FINAL_XOR_VALUE);
And now we are ready. Remember to save the file.
tcecc -O3 -a custom.adf -o crc_with_custom_op.tpef -k result \
crc_with_custom_op.c main.c
ttasim
Then enable the bus trace setting:
setting bus_trace 1
Load architecture and program and run the simulation
mach custom.adf
prog crc_with_custom_op.tpef
run
Verify that the result is the same as before (x /u w result). It should be the same as earlier (0x62488e82). Check the cycle count info proc cycles and compare it to the cycle count of the version which does not use a custom operation. You should see a very noticeable drop compared to the starting point architecture without the custom operations. Write this cycle count down for a later step and quit ttasim.
The simulator execution also created a file crc_with_custom_op.tpef.bustrace which contains the bus trace.
Now we have seen that the custom operation accelerates our application. Next we'll add a VHDL implementation of the custom FU to Hardware Database (hdb). This way we will be able to generate a VHDL implementation of our processor.
If you want to skip this phase you can use the given tour_example.hdb instead of creating it yourself.
Start HDBEditor (see Section 4.8):
hdbeditor &
TCE needs some data of the FU implementation in order to be able to
automatically generate processors that include the FU.
[width=0.8]eps/hdbeditor_scrshot
|
1. Name of the entity and the naming of the FU interface ports.
Name the implemention after the top level entity: ``fu_reflect''.
By examining the VHDL code you can easily spot the clock port (clk), reset (rstx) and global lock port (glock). Operation code (opcode) port is t1opcode. Write these into the appropriate text boxes. You do not have to fill the Global lock req. port field because the function unit does not stall, i.e. does not request a global lock to the processor during its execution.
2. Opcodes.
Check that the operation codes match the top of the vhdl file. REFLECT32 has operation code ``0'' and REFLECT8 has operation code ``1''.
The operation codes must be always numbered according to the alphabetical order of the OSAL operation names, starting at 0. For example, in this case REFLECT32 is earlier than REFLECT8 in the alphabetical order (i.e. 3 becomes before 8).
3. Parameters.
Parameters can be found from the VHDL generics. On top of the file there is one parameter: busw. It tells the width of the transport bus and thus the maximum width of input and output operands.
Thus, add parameter named busw, type it as integer and set the value to 32 in the Parameter dialog.
4. Architecture ports.
These settings define the connection between the ports in the architectural description of the FU and the VHDL implementation. Each input data port in the FU is accompanied with a load port that controls the updating of the FU input port registers.
Choose a port in the Architecture ports dialog and click edit. Name of the architecture port p1 is t1data (as in VHDL) and associated load port is t1load. Width formula is the parameter busw. No guard port is needed in this case.
Name the output port (p2) to r1data and the width formula is now busw because the output port writes to the bus. The output port does not have a load port.
5. Add VHDL source file.
Add the VHDL source file into the Source code dialog. Notice that the HDL files must be added in the compilation order (see section 4.8). But now we have only one source file so we can simply add it without considering the compilation order (Add -> Browse -> tour_vhdl/reflect.vhdl).
Now you are done with adding the FU implementation, see Fig. 3.7. Click OK.
In this step we generate the VHDL implementation of the processor, and the bit image of the parallel program.
You can either use the given custom_operations.idf included in the tutorial files or select the implementations yourself. If you use the given file replace custom.idf with custom_operations.idf in the following commands.
Next, we must select implementations for all components in the architecture. Each architecture component can be implemented in multiple ways, so we must choose one implementation for each component to be able to generate the implementation for the processor.
This can be done in the ProDe tool:
prode custom.adf
Then we'll select implementations for the FUs which can be done in Tools>Processor Implementation.... Note that the selection window is not currently very informative about the different implementations, so a safe bet is to select an implementation with parametrizable width/size.
You do not have to care about the HDB file text box because we are not going to use cost estimation data.
You can start processor generation from ProDe's implementation selection dialog: Click ``Generate Processor''. For Binary Encoding Map: Select the ``Generate new'', see Fig. 3.8
In the target directory click ``Browse'' and create a new directory proge-output and select it. Then click OK to create the processor.
Or alternatively execute ProGe from command line:
generateprocessor -t -i custom.idf -o proge-output custom.adf
Flag generates a testbench, defines the implementation file, and the output directory. Now directory proge-output includes the VHDL implementation of the designed processor except for the instruction memory width package which will be created by Program Image Generator. You can take a look what the directory includes, how the RF and FU implementations are collected up under vhdl subdir and the interconnection network has been generated to connect the units (the gcu_ic subdir). The tb subdir contains testbench files for the processor core. Moreover, there are 4 shell scripts for compiling and simulating the VHDL codes in Modelsim and GHDL.
Finally, to get our shiny new processor some bits to chew on, we use generatebits to create instruction memory and data memory images:
generatebits -d -w 4 -p crc_with_custom_op.tpef -x proge-output custom.adf
Flag creates data images, set data memory width 4 MAUs (Minimum Addressable Units, bytes in this case), defines the program, defines the HDL directory, and adf is given without a flag.
Now the file crc_with_custom_op.img includes the instruction memory image in ``ascii 0/1'' format. Each line in that file represents a single instruction. Thus, you can get the count of instructions by counting the lines in that file:
wc -l crc_with_custom_op.img
Accordingly, the file crc_with_custom_op_data.img contains the data memory image of the processor. Program Image Generator also created file proge-output/vhdl/tta0_imem_mau_pkg.vhdl which contains the width of the instruction memory (each designed TTA can have a different instruction width). The _imem_mau_pkg.vhdl file is appended with the top level entity name, which in this case is ``tta0''.
If you have GHDL or Modelsim installed you can now simulate the processor VHDL. First cd to proge-output directory:
cd proge-output
Then compile and simulate the testbench. With GHDL:
./ghdl_compile.sh
./ghdl_simulate.sh
Or with Modelsim:
./modsim_compile.sh
./modsim_simulate.sh
Compilation gives a couple of warnings from Synopsys' std_arith package but that should do no harm. Simulation will take some time as the bus trace writing is enabled. If you did not change the memory widths in Prode (Section 3.1.2 ), the simulation will crash and print ``Killed''. There will be many warnings at due uninitialized signals but you can ignore them.
After simulation, you should see a message
./testbench:info: simulation stopped by --stop-time
The simulation produces file ``bus.dump'' which looks like this:
0,00000000 1,00000004 2,00007ff8 3,00000018 ...
As the testbench is ran for constant amount of cycles we need to get the relevant part out of the bus dump for verification. This can be done with command:
head -n <number of cycles> bus.dump > sim.dump
where the <number of cycles> is the number of cycles in the previous ttasim execution (same as line ocunt in crc_with_custom_op.tpef.bustrace). Then compare the trace dumps from the VHDL simulation and the architecture simulation:
diff -u sim.dump ../crc_with_custom_op.tpef.bustrace
If the command does not print anything the dumps were equal and the RTL simulation matches the ttasim. Now you have succesfully added a custom operation, verified it, and gained a notable performance increase. Well done!
As a final step, let us add the custom operation to the large TTA.
cp large.adf large_custom.adf
prode large_custom.adf
Add the reflecter FU from the tour.hdb like done earlier. Fully connect the IC, as in Fig.3.9 and save.
tcecc -O3 -a large_custom.adf -o crc_with_custom_op.tpef -k result crc_with_custom_op.c main.c
ttasim -a large_custom.adf -p crc_with_custom_op.tpef
info proc cycles
Voila! Now you have a lightning fast TTA.
This tutorial is now finished. Now you should know how to customize the processor architecture by adding resources and custom operations, as well as generate the processor implementation and its instruction memory bit image.
This tutorial used a ``minimalistic'' processor architecture as our starting point. The machine had only 1 transport bus and 5 registers so it could not fully exploit the parallel capabilities of TTA. Then we increased resources in the processor and the cycle count dropped to half. Adding 2 simple custom operations to the starting point architecture gives a similar speedup and in large machine the speedup in cyclecount is about .
Pekka Jääskeläinen 2018-03-12