Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
There is no king road to optimized code. Help wanted.
#1
Last few days I decide to move away from GLSL shader's support in ZZogl (this new shaders are at worning stage, but few bugs need to be eliminated), and redone some of weirdest piece of code in ZZogl, that was untouched form days of ZeroGS -- the Mem.cpp.

The main goal of Mem.cpp (and Mem.h header) is to provide easy access to pixel data that stored in memory: at this point GSDX and ZZogl are look similar (just because the data storage is PS2 GS-manual covered).

So we have a function's that's convert pixel x, y into memory address (getPixelAddress##psm) and two functions writePixel##psm and ReadPixel##psm. The most troublesome part is ##psm -- we have two copies of this function for each psm (and here we have total 11 of them). So it give to us 66 functions. Most of them are copy-pasted with small changes. So it seems to be appropriate to get rid of this psm part. By the way, psm is pixel storage format, that describe how to put data in memory, PS2 manual cover it completely.

How do this function used? They called in switches block's, that made call for every psm we have. Such switches are big and hidden everywhere. Could we use one function for every getPixelAddress? Oh yes. It not to hard either. Pixel's in PS2 memory stored in block's 2K size, and in each block's pixel's are swizzled in specific maner. Size of this block W * H is psm-spefic, it could be 32*64, 64*64, 64*127 or 127*127 (and it dependant of pixel size: 32, 16, 8 or 4 bit).

How to know number of storage block for pixel x,y? Easy: x / W + (y / H) * (width / W).
To obtain pixel address inside block we have a table gPageTable[psm][W][H]. So our result should be getAddress = baseAddress + (x/w + y/H * width/W) * 4 + gPageTable[psm][x & W][y & H] * PixelSize.

Well, we realized this code and try to real game: Breath of Fire. This game use pixel transfer a lot, so we could see that performance drop per 40%! Why? Our new function are doing exactly the same that old ones?

The answer is simple and it lies in compiller optimization routine. Old functions have constant values of W, H and PixelSize, new ones does not. So when compiler do inlining, it does not know that this function should be divide in 11 different ones for each psm! It put "global" variant of code, with this variables obtained from memory. Also operations x/w would not be good optimized in global variant. Old code provide nice x << CONST operation's for integer divide, new one only x << VAR.

Well, to solve this mystery I wrote some funny define, that change function call to switch with fixed psm's. It allow to keep performance at almost the same performance level. But it seems, that I miss 5% of performance somewhere, that's bad.

So what's point? I don't know, right way to solve this mystery. I want to get rid of switches in main code, that made this code unreadable (and at such process I found a few silly bug's). But I don't know how to keep code at the very same performance.
Reply

Sponsored links





Users browsing this thread: 1 Guest(s)