4.30.2008

Low Level Performance : FCMP

The 5th rule of performance optimization is "Know thy compiler." Specifically, know what your C++ code is generating at the machine level, and understand that there may be things going on further down that could cause performance concerns that do not seem apparent in your higher level language. This is a difficult concept for any novice programmer, as it usually means you have to give a 2nd look at your code for performance that occurs at the syntax level, which is the most difficult for programmers to address. (We tend to like program flow and how we decide things...)

So, I'd like to take a second and talk about one optomization for PowerPC platforms that causes a world of problems : FCMP

A floating-point compare instruction (fcmp) can introduce an execution delay of 30 CPU cycles, if it is immediately followed by a conditional branch instruction.

The execution penalty occurs because the fcmp takes more than one cycle to execute through the CPU pipeline. The CPU detects this and flushes the most recent instructions from the pipelines, and then reissues them while the conditional branch waits for the result. It may have to do this several times before the fcmp result is available. Once the pipeline is empty, execution slows as the pipeline is reloaded through the execution of subsequent instructions.

When FCMP needs to return a float : use FSEL instead
In many cases fcmp can be avoided by using fsel instead. For instance, the floating-point versions of fmin and fmax shown in the following code, are branchless, avoid pipeline flushes, and also give the compiler more scheduling flexibility:



#define fpmax(a,b) __fsel((a)-(b), a,b)
#define fpmin(a,b) __fsel((a)-(b), b,a)



So for a real world example, here's some code that will add a minimum distance to a bounding box when it can be potentially small:

Before :


if((maxCorner.x - minCorner.x) < cMinDelta)
   x = minCorner.x+cMinDelta;
if((maxCorner.y - minCorner.y) < cMinDelta)
   y = minCorner.y+cMinDelta;
if((maxCorner.z - minCorner.z) < cMinDelta)
   z = minCorner.z+cMinDelta;



After :

const float maxY = minCorner.y+cMinDelta;
const float maxZ = minCorner.z+cMinDelta;

const float xTest = cMinDelta - _fabs(maxCorner.x - minCorner.x);
const float yTest = cMinDelta - _fabs(maxCorner.y - minCorner.y);
const float zTest = cMinDelta - _fabs(maxCorner.z - minCorner.z);

maxCorner.x = __fsel(xTest,maxX,maxCorner.x);
maxCorner.y = __fsel(yTest,maxY,maxCorner.y);
maxCorner.z = __fsel(zTest,maxZ,maxCorner.z);


In the above situation, it was much more efficient to compute the result of the float comparison, and instead use the fsel to decide between the original max value, and the adjusted max value.

The basic concept here is to turn the fcmp into a mathematical operation and then use fsel to decide between two paths.


When FCMP needs to return a bool : use bitmasking
The main problem with fsel is that it returns a floating point value. So it doesn't help you when you're attempting comparisons that return boolean results (assume a & b are floats):

Before:

if(a > 0) continue;

After:

const int pT0 = reinterpret_cast < const int* > (&a)[0];
if(pT0 & 0x80000000)continue; // check the sign bit




Before:

if(a == b) continue;

After:

const float test0 = a - b;
const int pT0 = reinterpret_cast < const int* > (&test0)[0];
if(pT0 == 0)continue; //will be null if a==b




Before:

if(a > b) continue;

After:

const float test0 = b - a;
const int pT0 = reinterpret_cast < const int* > (&test0)[0];
if(pT0 & 0x80000000)continue; //sign bit will be negative if a > b


The reinterpret_cast function allows us to view the binary representation of the float value w/o having to do a conversion to int. This helps us eliminate any potential load-hit-store penalties that could occur in the process.


~Main

4.20.2008

Can you pull that out? Thanks...

Abstracting code out to proper, generic, reusable containers is not something that the novice programmer excels at. Most often, code produced is highly specific, very static, and often, not to easy to modify and maintain over time. Early on, I devised a way to fix most of these issues, as I was writing the code. As time went on, I learned to identify the things that could be extracted out into generic containers early on, and avoided having to do multiple passes over my code to generate them.

"Where is this useful?" I've often found that data managers can, on the whole, be turned into generic functions. For example, a streaming terrain system could be abstracted to several generic containers : Generic Cache, LRU Cache, Streaming Class, Streamed Cache, Streamed Cache populator etc etc. These all could be rolled into one manager for the process, but then if you'd like to duplicate everything for streaming meshes, well, you'd almost have to re-write the entire damn thing!

In general, here's the process I hold to :

  1. Write the code as it functions and works in your head. Make sure you're including all the bells and whistles you need. As a general rule here, you want to make the interfaces as slim and generic as possible. Not just because you're planning on genericising it, but because that's just good programming. Note, this also includes allowing massive initialization parameters to be bunched up into classes so that if your function is overridden, the parameters can also be overriden to be expanded by other implementing classes.
  2. Do a quick check to see if any of the operating functionality of your class can be either replaced, or better represented with common generic containers. I'm not saying that your custom, slim, name-value hash paring isn't good, but if it can be replaced by an STL container, or another one that is used elsewhere in the code, you can bet it may be more reliable, and have more general functions. Yea, and as stupid as this sounds, you'd be amazed how many times you'll re-write basic functionality for a system that already exists in a generic form elsewhere. Don't laugh, you're guilty too!
  3. Take a look at your data and how your using it. Could there be any part of your data that could be considered abstracted? Can how you're using it be abstracted? In some cases, can your data be replaced by generic dynamic containers? (Ie can a statically allocated array be replaced by a statically templated array?)
  4. Create new containers that are 'above' your current class. Always approach your current class as the result of inheritance of several container classes; IE it's the leaf of the tree. Start with baby steps in this process. If you get hung up in an area that can't be genericied, then consider it as part of your leaf class, and not part of your generic container. Standard warnings apply about ensuring that your generic classes are memory safe, potentially thread safe, and well documented about it's differences and operations against other containers.
  5. Repeat these steps until you can get at least one generic container out of your data manager. If you can get more, even better.

Yea, it's a bit slower than you'd like, but atleast you're getting some good, reusable stuff out of it. And really, when I do code reviews, this is one of the things I look at, just how generic is your code, son?

~Main