So here's my personal checklist of tasks when looking at slow code.
- Anything that can be moved into the Async Phase, should be.
- This is the smallest change footprint in terms of code flow, and yields the biggest perf gains. IMHO we should be taking a serious look at the sim for this purpose right now.
- Can anything be ignored / non processed
- Fastest processing is the one we don’t have to do. (You’d be surprised about how often this is a big factor…)
- Can anything be batched / deferred
- LHS and pointer chasing issues can sometimes be addressed by this process.
- Inner loop fixes
- Can things be bubbled up, out of the inner loop?
- Is it faster to compute elements of the loop seperatly?
- IE it may be faster to compute all the VMX operations in a loop before the data processing loop. Analysis shows that VMX hits a sweetspot at a large amount of instruction throughput, so computing one at a time inside the loop could actually be slower. Especially if you’re just removing the data from the VMX register to the float or int registers.
- Address slow code inside the loop
- LHS, L2 Cache miss, etc
Some have told me that I'm doing it backwards, that I should be looking to optimize the inner loop first. Truth be told, optimizing the inner loop won't get you the massive speed increases that you would think of. And often, you'll find that the slowness of inner-loop code just can't be fixed.
What then, smart ass?
In my experience, taking a look at multi threading as a first step allows you to get the best performance improvement w/o having to make the most code changes. For instance, if you've got a 6 thread thread pool, you should be expecting a 6x speedup of your processing immediately. Not bad for a day's work... Now, if you're code's not thread safe, or just can't be computed in parallel, then you need to start looking at serialized fixes. IE bubbling up slow code out of loops, and filtering operations that don't need computation. As another task, it seems that you'll rarely get that type of performance increase from simply filtering out un-needed work, but every bit helps, and once you cut-the-chaff, you can start looking at optimizing each path (or combining paths) to get even greater gains.
At the end of the list is the inner loop optimizations, which includes low-level performance issues. I wait until the end of the perf for this, mainly because it takes quite a bit longer to track down and address these issues when compared to the other items above.
~Main
0 comments:
Post a Comment