TCE supports single and half precision floating-point calculations. Single-precision calculations can be performed by using the float datatype in C code, or by using macros from tceops.h, such as _TCE_ADDF.
If the compilation target architecture does not support these operations,
they can be emulated using integer arithmetic in software. Passing the switch --swfp
to tcecc enables the software emulation library linkage.
A set of floating-point FU implementations is included with TCE, in a HDB file named fpu_embedded.hdb, which can be found at PREFIX/share/tce/hdb/fpu_embedded.hdb. The FUs operate with 32-bit, single-precision floating point numbers. Supported operations include addition, subtraction, negation, absolute value, multiplication, division, square root, conversion between floats and integers, and various comparisons.
The FUs are based on the VHDL-2008 support library (http://www.vhdl.org/fphdl/), which is in public domain. Changes include:
The FUs are optimized for synthesis on Altera Stratix II FPGA's, and they have been benchmarked both on a Stratix II EP2S180F1020C3, and a Stratix III EP3SL340H1152C2. They have maximum frequencies between 190-200 MHz on the Stratix II, and between 230-280 MHz on the Stratix III. Compared to an earlier implementation based on the Milk coprocessor (coffee.cs.tut.fi), they are between 30% and 200% faster.
The FUs are not IEEE compliant, but instead comply to the less strict OpenCL Embedded Profile standard, which trades off accuracy for speed. Differences include:
The TCE Processor Simulator uses IEEE-compliant floats. With a processor simulated on GHDL or synthesized on actual hardware, the calculation results are thus slightly different from the ones from Processor Simulator.
The emphfpu_embedded and emphfpu_half function units are described in detail below.
Supported operations: addf, subf
Latency: 5
Straightforward floating-point adder.
Supported operations: mulf
Latency: 5
Straightforward floating-point multiplier.
Supported operations: divf
Latency: 15 (mw/2+3)
Radix-4 floating-point divider.
Supported operations: macf, msuf
Latency: 6
Single-precision fused multiply-accumulator.
Parameters are ordered so that MACF(a,b,c,d) is equal to d=a+b*c and MSUF(a,b,c,d) to d=a-b*c. Special case handling is not yet supported.
Supported operations: macf, msuf, addf, subf, mulf
Latency: 6
Single-precision fused multiply-accumulator. Performs addition/subtraction by multiplying by 1, and multiplication by adding 0. fpu_sp_mac_v2 will replace fpu_sp_mac completely if benchmarking shows it to be reasonably fast.
Parameters are ordered so that MACF(a,b,c,d) is equal to d=a+b*c and MSUF(a,b,c,d) to d=a-b*c.
Supported operations: sqrtf
Latency: 26 (mw+3)
Floating-point square root FU, using Hain's algorithm.
Note that the C standard function sqrt does not take advantage of hardware acceleration; the _TCE_SQRTF macro must be used instead.
Supported operations: cif, cifu, cfi, cfiu
Latency: 4
Converts between 32-bit signed and unsigned integers, and single-precision floats. OpenCL embedded allows no loss of accuracy in these conversions, so rounding is to nearest even.
Supported operations: absf, negf, eqf, nef, gtf, gef, ltf, lef
Latency: 1
A floating-point comparator. Also includes the cheap absolute value and negation operations.
A set of half-precision arithmetic units is included with tce in PREFIX/share/tce/hdb/fpu_half.hdb. In C and C++, half-precision operations an only be invoked with tceops.h macros. It may be helpful to define a half class with overloaded operators to wrap the macros. The test case testsuite/systemtest/proge/hpu is written using such a class. There is ongoing work to add acceleration for the half datatype in OpenCL.
Like their single-precision counterparts, the half-precision FPUs round to zero and lack support for denormal numbers. In addition, they do not yet handle special cases such as INFs and NaNs.
The emphfpu_half function units are described in detail below.
Supported operations: cfh, chf
Latency: 1
Converter between half-precision and single-precision floating points.
Supported operations: addh, subh
Latency: 1
Straightforward half-precision floating-point adder.
Supported operations: mulh, squareh
Latency: 2
Straightforward half-precision floating-point multiplier. Also supports a square-taking operation.
Supported operations: mach, msuh
Latency: 3
Half-precision fused multiply-accumulator.
Parameters are ordered so that MACH(a,b,c,d) is equal to d=a+b*c and MSUH(a,b,c,d) to d=a-b*c.
Supported operations: mach, msuh, addh, subh, mulh
Latency: 3
Half-precision fused multiply-accumulator. Performs addition/subtraction by multiplying by 1, and multiplication by adding 0. fpmac_v2 will replace fpmac completely if benchmarking shows it to be reasonably fast.
Parameters are ordered so that MACH(a,b,c,d) is equal to d=a+b*c and MSUH(a,b,c,d) to d=a-b*c.
Supported operations: invsqrth
Latency: 5
Half-precision fast inverse square root using Newton's iteration.
Supported operations: absh, negh, eqh, neh, gth, geh, lth, leh
Latency: 1
Half-precision floating-point comparator. Also includes the absolute value and negation operations.
Most of the single-precision FPUs have been benchmarked on the FPGAs Stratix II EP2S180F1020C3 and Stratix III EP3SL340H1152C2. As a baseline, a simple TTA processor was synthesized that had enough functionality to support an empty C program. After this, each of the FPUs was added to the baseline processor and synthesized. The results are shown below in Tables 3.2 and 3.3.
|
The fpu_embedded Function Units have mantissa width and exponent width as generic parameters, so they can be used for float widths other than the IEEE single precision. The FPUs are likely prohibitively slow for double-precision computation, and the fpu_half units should be better fit for half-precision.
The parameters are mw (mantissa width) and ew (exponent width) for all FUs. In addition, the float-int converter FU fpu_sp_conv has a parameter intw, which decides the width of the integer to be converted.
Use of these parameters has the following caveats:
Designers of floating point TTAs should note that ttasim uses the simulator host's floating point (FP) hardware to simulate floating point operations (for speed reasons). Thus, it might or might not match the FP implementation of the actual implemented TTA as it depends on the standard compliance, the default rounding modes, and other differences between floating point implementations.
Pekka Jääskeläinen 2018-03-12