Subsections

9 Designing Floating-point Processors with TCE

TCE supports single and half precision floating-point calculations. Single-precision calculations can be performed by using the float datatype in C code, or by using macros from tceops.h, such as _TCE_ADDF.

If the compilation target architecture does not support these operations, they can be emulated using integer arithmetic in software. Passing the switch --swfp to tcecc enables the software emulation library linkage.

A set of floating-point FU implementations is included with TCE, in a HDB file named fpu_embedded.hdb, which can be found at PREFIX/share/tce/hdb/fpu_embedded.hdb. The FUs operate with 32-bit, single-precision floating point numbers. Supported operations include addition, subtraction, negation, absolute value, multiplication, division, square root, conversion between floats and integers, and various comparisons.

The FUs are based on the VHDL-2008 support library (http://www.vhdl.org/fphdl/), which is in public domain. Changes include:

Full pipelining.
Radix-2 division changed to Radix-4.
Simple newton's iteration square root (with division in each pass) replaced by Hain's algorithm from paper "Fast Floating Point Square Root" by Hain T. and Mercer D.

The FUs are optimized for synthesis on Altera Stratix II FPGA's, and they have been benchmarked both on a Stratix II EP2S180F1020C3, and a Stratix III EP3SL340H1152C2. They have maximum frequencies between 190-200 MHz on the Stratix II, and between 230-280 MHz on the Stratix III. Compared to an earlier implementation based on the Milk coprocessor (coffee.cs.tut.fi), they are between 30% and 200% faster.

1 Restrictions

The FUs are not IEEE compliant, but instead comply to the less strict OpenCL Embedded Profile standard, which trades off accuracy for speed. Differences include:

Instead of the default rounding mode round-to-nearest-even, round-to-zero is used.
Denormal numbers as inputs or outputs are flushed to zero.
Division may not be correctly rounded, but should be accurate within 4 ulp.

The TCE Processor Simulator uses IEEE-compliant floats. With a processor simulated on GHDL or synthesized on actual hardware, the calculation results are thus slightly different from the ones from Processor Simulator.

2 Single-precision Function Units

The emphfpu_embedded and emphfpu_half function units are described in detail below.

fpu_sp_add_sub

Supported operations: addf, subf

Latency: 5

Straightforward floating-point adder.

fpu_sp_mul

Supported operations: mulf

Latency: 5

Straightforward floating-point multiplier.

fpu_sp_div

Supported operations: divf

Latency: 15 (mw/2+3)

Radix-4 floating-point divider.

fpu_sp_mac

Supported operations: macf, msuf

Latency: 6

Single-precision fused multiply-accumulator.

Parameters are ordered so that MACF(a,b,c,d) is equal to d=a+b*c and MSUF(a,b,c,d) to d=a-b*c. Special case handling is not yet supported.

fpu_sp_mac_v2

Supported operations: macf, msuf, addf, subf, mulf

Latency: 6

Single-precision fused multiply-accumulator. Performs addition/subtraction by multiplying by 1, and multiplication by adding 0. fpu_sp_mac_v2 will replace fpu_sp_mac completely if benchmarking shows it to be reasonably fast.

Parameters are ordered so that MACF(a,b,c,d) is equal to d=a+b*c and MSUF(a,b,c,d) to d=a-b*c.

fpu_sp_sqrt

Supported operations: sqrtf

Latency: 26 (mw+3)

Floating-point square root FU, using Hain's algorithm.

Note that the C standard function sqrt does not take advantage of hardware acceleration; the _TCE_SQRTF macro must be used instead.

fpu_sp_conv

Supported operations: cif, cifu, cfi, cfiu

Latency: 4

Converts between 32-bit signed and unsigned integers, and single-precision floats. OpenCL embedded allows no loss of accuracy in these conversions, so rounding is to nearest even.

fpu_sp_compare

Supported operations: absf, negf, eqf, nef, gtf, gef, ltf, lef

Latency: 1

A floating-point comparator. Also includes the cheap absolute value and negation operations.

3 Half-precision Support

A set of half-precision arithmetic units is included with tce in PREFIX/share/tce/hdb/fpu_half.hdb. In C and C++, half-precision operations an only be invoked with tceops.h macros. It may be helpful to define a half class with overloaded operators to wrap the macros. The test case testsuite/systemtest/proge/hpu is written using such a class. There is ongoing work to add acceleration for the half datatype in OpenCL.

Like their single-precision counterparts, the half-precision FPUs round to zero and lack support for denormal numbers. In addition, they do not yet handle special cases such as INFs and NaNs.

4 Half-precision Function Units

The emphfpu_half function units are described in detail below.

fpu_chf_cfh

Supported operations: cfh, chf

Latency: 1

Converter between half-precision and single-precision floating points.

fpadd_fpsub

Supported operations: addh, subh

Latency: 1

Straightforward half-precision floating-point adder.

fpmul

Supported operations: mulh, squareh

Latency: 2

Straightforward half-precision floating-point multiplier. Also supports a square-taking operation.

fpmac

Supported operations: mach, msuh

Latency: 3

Half-precision fused multiply-accumulator.

Parameters are ordered so that MACH(a,b,c,d) is equal to d=a+b*c and MSUH(a,b,c,d) to d=a-b*c.

fpmac_v2

Supported operations: mach, msuh, addh, subh, mulh

Latency: 3

Half-precision fused multiply-accumulator. Performs addition/subtraction by multiplying by 1, and multiplication by adding 0. fpmac_v2 will replace fpmac completely if benchmarking shows it to be reasonably fast.

Parameters are ordered so that MACH(a,b,c,d) is equal to d=a+b*c and MSUH(a,b,c,d) to d=a-b*c.

fp_invsqrt

Supported operations: invsqrth

Latency: 5

Half-precision fast inverse square root using Newton's iteration.

fpu_hp_compare

Supported operations: absh, negh, eqh, neh, gth, geh, lth, leh

Latency: 1

Half-precision floating-point comparator. Also includes the absolute value and negation operations.

5 Benchmark results

Most of the single-precision FPUs have been benchmarked on the FPGAs Stratix II EP2S180F1020C3 and Stratix III EP3SL340H1152C2. As a baseline, a simple TTA processor was synthesized that had enough functionality to support an empty C program. After this, each of the FPUs was added to the baseline processor and synthesized. The results are shown below in Tables 3.2 and 3.3.

Table 3.2: Synthesis results for Stratix II EP2S180F1020C3

	mul	add_sub	sqrt	conv	comp	div	baseline
Comb ALUTs	1263	1591	4186	1500	1012	2477	907
Total regs	892	967	2444	917	669	1942	567
DSP blocks	8	0	0	0	0	0	0
$F_{max} (MHz)$	196.39	198.81	194.78	191.5	192.2	199.32	222.82
Latency	5	5	26	4	1	15	-

Table 3.3: Synthesis results for Stratix III EP3SL340H1152C2

	mul	add_sub	sqrt	conv	comp	div	baseline
Comb ALUTs	1253	1630	4395	1507	1002	2597	1056
Total regs	819	1007	2401	997	665	2098	710
DSP blocks	4	0	0	0	0	0	0
$F_{max} (MHz)$	272.03	252.4	232.07	232.88	244.32	260.82	286.45
Latency	5	5	26	4	1	15	-

6 Alternative bit widths

The fpu_embedded Function Units have mantissa width and exponent width as generic parameters, so they can be used for float widths other than the IEEE single precision. The FPUs are likely prohibitively slow for double-precision computation, and the fpu_half units should be better fit for half-precision.

The parameters are mw (mantissa width) and ew (exponent width) for all FUs. In addition, the float-int converter FU fpu_sp_conv has a parameter intw, which decides the width of the integer to be converted.

Use of these parameters has the following caveats:

The TCE-CC compiler converts floating-point literals into 32-bit floats, so they have to be entered some other way, f.ex. by casting integer bitpatterns to floats, or with a cif operation.
TCE does not include a HDB file for alternative bit widths
Mantissa width affects the latency of the divider and square root FUs. The divider FU's latency is , and the square root FU's latency is .
Bit widths other than single-precision have not been exhaustively tested. Half-precision floats appear to work in a simple test case.

7 Processor Simulator and Floating Point Operations

Designers of floating point TTAs should note that ttasim uses the simulator host's floating point (FP) hardware to simulate floating point operations (for speed reasons). Thus, it might or might not match the FP implementation of the actual implemented TTA as it depends on the standard compliance, the default rounding modes, and other differences between floating point implementations.

Pekka Jääskeläinen 2018-03-12