Make sure you have enough general purpose registers and register file read ports. Using more than two read ports in one register file may make it big, slow and power-hungry, so consider splitting the RF to multiple register files.
However, it should be noted that the current compiler does not perform intelligent distribution of variables to the register files, so if the machine is not equally connected to all the function units that consume each variable, you might end up with ``connectivity register copies'' (see below). The current register distribution method is round robin which balances the RF load somewhat, but not perfectly, as it doesn't take in account the computation that uses each variable.
To produce better register distribution, the compiler should balance the RF load at the granularity of ``computation trees'', not at the granularity of a single register to produce more parallelizable code and to reduce register copies. Improving this part of the code generation is being worked on with high priority.
Pekka Jääskeläinen 2018-03-12