The Reference Implementation is likely to use the simplest design possible, but there are a lot of other options that could be implemented.
This document identifies some of the variations that could be created.
Pretty much all these options trade added complexity for improved performance.
Commercial Companies may wish to implement these ideas to get the extra performance, since they can pay the complexity cost.
A simple expansion of the Virtual Unit Idea to be Virtual in Width and Depth.
By storing several inputs in looping memory, a single unit can have a Virtual Width as if it was M Units wide and N Units deep. Note: It must be able to store NxM Weights in the unit's loop, this tends to require a very large weights loop.
For the majority of cases, this added complexity give very little benefit.
It would be simple to define a large weight loop VDepth that can be 'short-circuited' to act as a shorter loop (lower VDepth).
This would require that it be able to consume more inputs per unit, since it would require one input value per loop, but the loop could be reduced (potentially to VDepth = 1).
I originally intended the Reference design to use this, but the variation in Input feeding seemed like it'd be more complicated. The changes to the actual VSA Unit are pretty simple, but the data infeed needs to be considered.
When accumulating the results from one unit to the next, they must be passed on, one at a time. Visually, Inputs are fed from the top and fall through the bottom, while the outputs are passed sideways. Left to Right or Right to left has no effect, in fact, the inputs don't have to come from the top. so the orientation only matters for caining Units together and their orientation on the chip.
However, Dynamic passing, where any unit can pass either to the right, or the left, based on a configured bit could boost performance. A chain of N Units that has an output to the right and to the left, that can de dynamically split, would allow it to be 2 chains of differnet lengths. A chain of 64 Units could be 32, 32, or 16, 48 or 64, 0 which would allow it to be dynamically alocated based on need, similar to Early Exits.
This seems like it would be an easy win to improve performance, but since it adds additional complexity, especially in the compiler, I think it's best left alone until it's proven.
A Standard Systolic Array uses input fall through to partially reuse the input values. The default VSA Unit keeps the current input in the Unit until it's done for partial reuse.
However, the VSA Unit can only process depths up to VDepth, if a greater depth is needed, the inputs must be duplicated.
By allowing the VSA Unit to also let values fall through to the next Unit, it can pass Inputs without them being explicitly duplicated.
However, this creates some additional challenges; if no alternate input path is provided to the fall through unit, then when lower depths are needed, the units sits idle. If an alternate input path is defined for the fall through unit, then this alternate path results in added complexity. The same functionality can be approximated with input duplication.
Note: In some edge cases + implementations, using fall through can reduce the latency. But it would be comparable to Input Duplication.
If combined with Variable VDepth, then using fall through is more beneficial, but without Variable VDepth, only a small number of fall throughs are practical. For Example, a VDepth=32 network with 4 fall throughs would have a true VDepth of 128 (same as TPU) but without Variable depth, would loose performance on any layer with less than 128 outputs.
Since each Unit Accumulates Values from the previous Unit, the first Unit could receive the Bias as an input for the Accumulator. The biases would be stored in a looping buffer the same as the weights.
This adds additional complexity, and if you check the Compiler Tricks Document, it's possible to use an extra Unit to approximate this. Additionally, because external accumulators are going to be needed, those could (probably) take care of the Bias easily.