So, I'd like to take a second and talk about one optomization for PowerPC platforms that causes a world of problems : FCMP
A floating-point compare instruction (fcmp) can introduce an execution delay of 30 CPU cycles, if it is immediately followed by a conditional branch instruction.
The execution penalty occurs because the fcmp takes more than one cycle to execute through the CPU pipeline. The CPU detects this and flushes the most recent instructions from the pipelines, and then reissues them while the conditional branch waits for the result. It may have to do this several times before the fcmp result is available. Once the pipeline is empty, execution slows as the pipeline is reloaded through the execution of subsequent instructions.
When FCMP needs to return a float : use FSEL instead
In many cases fcmp can be avoided by using fsel instead. For instance, the floating-point versions of fmin and fmax shown in the following code, are branchless, avoid pipeline flushes, and also give the compiler more scheduling flexibility:
#define fpmax(a,b) __fsel((a)-(b), a,b)
#define fpmin(a,b) __fsel((a)-(b), b,a)
So for a real world example, here's some code that will add a minimum distance to a bounding box when it can be potentially small:
Before :
if((maxCorner.x - minCorner.x) < cMinDelta)
x = minCorner.x+cMinDelta;
if((maxCorner.y - minCorner.y) < cMinDelta)
y = minCorner.y+cMinDelta;
if((maxCorner.z - minCorner.z) < cMinDelta)
z = minCorner.z+cMinDelta;
After :
const float maxY = minCorner.y+cMinDelta;
const float maxZ = minCorner.z+cMinDelta;
const float xTest = cMinDelta - _fabs(maxCorner.x - minCorner.x);
const float yTest = cMinDelta - _fabs(maxCorner.y - minCorner.y);
const float zTest = cMinDelta - _fabs(maxCorner.z - minCorner.z);
maxCorner.x = __fsel(xTest,maxX,maxCorner.x);
maxCorner.y = __fsel(yTest,maxY,maxCorner.y);
maxCorner.z = __fsel(zTest,maxZ,maxCorner.z);
In the above situation, it was much more efficient to compute the result of the float comparison, and instead use the fsel to decide between the original max value, and the adjusted max value.
The basic concept here is to turn the fcmp into a mathematical operation and then use fsel to decide between two paths.
When FCMP needs to return a bool : use bitmasking
The main problem with fsel is that it returns a floating point value. So it doesn't help you when you're attempting comparisons that return boolean results (assume a & b are floats):
Before:
if(a > 0) continue;
After:
const int pT0 = reinterpret_cast < const int* > (&a)[0];
if(pT0 & 0x80000000)continue; // check the sign bit
Before:
if(a == b) continue;
After:
const float test0 = a - b;
const int pT0 = reinterpret_cast < const int* > (&test0)[0];
if(pT0 == 0)continue; //will be null if a==b
Before:
if(a > b) continue;
After:
const float test0 = b - a;
const int pT0 = reinterpret_cast < const int* > (&test0)[0];
if(pT0 & 0x80000000)continue; //sign bit will be negative if a > b
The reinterpret_cast function allows us to view the binary representation of the float value w/o having to do a conversion to int. This helps us eliminate any potential load-hit-store penalties that could occur in the process.
~Main