Lets do:
1 2
|
float A{}, B{}, C{};
float D = A * B + C;
|
What would it be with FPU (approximately)?
1. Load 32-bit A into 80-bit register
2. Multiply register with 32-bit B
3. Truncate register to 32-bit value
4. Add to register with 32-bit C
5. Truncate&store register into 32-bit D
How about FPU + fastmath?
1. Load 32-bit A into 80-bit register
2. Multiply register with 32-bit B
3. Truncate register to 32-bit value
4. Add to register with 32-bit C
5. Truncate&store register into 32-bit D
The difference is that now we add C to the 80-bit result of A*B, rather than to IEEE-754 dictated lower precision 32-bit value
Plan B: Lets use vector instruction (SSE) set:
1. Load 32-bit A into 32-bit SSE register (the register has 128 bits, for four 32-bit floats)
2. Load 32-bit B into other 32-bit SSE register
3. Multiply the two registers. The result is in one 32-bit part of the register (The multiply instruction did operate on all four parts)
4. Load 32-bit C into other 32-bit SSE register
5. Add the two registers. The result is in one 32-bit part of the register (The add instruction did operate on all four parts)
6. Store register into 32-bit D
Just like with plain FPU, all operations were with 32-bit values.
The benefit of SSE is elsewhere, in "auto-vectorization", which is not on by default.
Lets take:
1 2 3 4 5 6 7 8
|
float A[4]{};
float B[4]{};
float C[4]{};
float D[4]{}
for ( int i=0; i<4; ++i ) {
D[i] = A[i] * B[i] + C[i];
}
|
Our pseudo-code FPU operation had 5 steps, so the body of loop has 20 steps, and the loop ops on top of that.
If compiler is told (and if it can) to vectorize the loop, then al it takes is those 6 steps (and no loop ops) to perform the operation for four floats simultaneously.