Optimize your operation set for your code. Make sure you support the performance-critical instructions with hardware operations. Software-emulation of some operations is usually 50-100 times slower than hardware execution and might prevent parallelization of multiple iterations of the core kernel loop.
For example, if you do a lot of floating point calculations, use hardware floating point units. If your inner loop contains integer multiplications, put a multiplier in your architecture.
On the other hand, if you do not need a floating point unit, integer multiplier or a divider, do not put them to your achitecture ``just in case'', as they increase the size of the processor, power consumption, and also the size of the instruction word. It also makes the interconnection network more complex which often also reduces the clock speed and therefore performance. Unless, of course, you need the ``general purposity`` and you predict that some of the future programs that are going to run in your processor might be able to exploit such hardware units.
After your selection of basic operations is good for your program, and you still need more performance or lower power consumption, then consider using custom operations.
Pekka Jääskeläinen 2018-03-12