Myth #1: More registers would surely help Pcsx2!
No. I mean yea sure it might help a couple specific scenes in some games, but most of Pcsx2's problem is simply the nuances of recompilation and PS2 hardware emulation (events, irqs, etc) -- these things all require register flushes and they happen often enough that Pcsx2's recompilers rarely get a chance to max out the current XMM register allocation as it is. x64's extra 8 registers would help startup times and maybe some FMV playback, but would be nearly useless for the bulk of VU1 work (which has pipeline problems in addition to ~64 registers that make register allocation impossible at any level, unless we had about 128 registers handy). Note: VU1 is currently the slowest part of the emulator, so if it doesn't help there, it's not going to be worthwhile. Furthermore, Pcsx2's system of MIPS register mapping is quite fast, with the whole table persisting in the L1 cache most of the time, which is basically as fast as actual cpu register use. Optimizing these writebacks to additional registers helps reduce code size, which is good for L1/L2 cache pollution -- but in the end doesn't speed up the code by much, because:
Myth #2: x64 doesn't inherently suck in design!
It sure does. x64 brings a lot of baggage with it, namely in the form of code/data bloat. Using x64 registers requires a prefix byte. Lookup tables become 8-bytes per entry instead of 4. Other LUTs must be converted to base+offset form instead of direct 32-bit address form because they're too large to afford to double (going from 80MB to 160MB for our primary recompiler LUT would suck, for example). In addition, x64's addressing system is flawed and often requires either a pair of instructions and/or a register allocated specifically for the purpose of fetching/storing data. All of these increase code and data size, pollute the L1/L2 caches, and negate most of the (limited) benefit of additional XMM registers cleaning up the caches.
For these reasons, the x64 memory addressing model would incur severe penalties both in our memory emulation layer (vtlb) and our recompiled block lookup table, due to increased table size, and/or increased number of operations to handle them. These are the most frequently executed code parts of the EE, thus they matter a lot.
Myth #3: The PS2's EE is 64 bits, so surely the 64 bit functions of x64 would be a big speedup!
Not really. Most of the EE's 64 bit functions are pseudo-64. They're actually 32 bits arithmetic where the result is sign-extended into the upper 64 bits. x86/32 includes a special instruction for such sign extension (CDQ), and it's quite fast. Additionally, the recompiler performs liveness checks on the upper 32 bits and a good portion of the time the upper 32 bits is known to be constant or preserved. In those cases the current rec simply uses 32 bit math regardless, so x64 would be no gain at all for those. Moreover, a handful of the EE's 64 bit operations can be done using MMX or XMM registers and instructions, so those too would see no change when going from x32 to x64. So yeah, x64 would be a minor benefit, but none too significant, and easily outweighed by the extra overhead incurred by the recompiler and vtlb LUTs (which are both more frequently executed code anyway).
Myth #4: Clearly no one has done proper experimentation or benchmarked Pcsx2 under x64 recompiler conditions!
Woops? Versions 0.9.4 and 0.9.5 SVNs had working x64 builds. I guess someone forgot to do their research. Those builds were also roughly 10-15% slower than the x32 build on most in-game scenes (although rumor has it, the BIOS ran slightly faster, for what that's worth!). And indeed they were not at all well optimized, but there wasn't much point in continuing the work anyway since it was clear from an educated standpoint that it wasn't going to reap many benefits (if any).
Moral of the Story: Compiled programs have tremendous optimization advantages over recompiled code in the land of x64, so looking at a 3Dmark compiled in x64 has no bearing on Pcsx2's performance scale. The PS2 doesn't need much actual 64-bit arithmetic, and add to that an MMX unit available for using certain types of 64 bit arithmetic and memory operations even on 32bit (something most compilers fail to make much use of in compiled code, but Pcsx2 uses quite effectively), and you start to see the gap closing quickly. Additionally, few apps work in such an "inefficient" manner as an emulator -- in that the emulator usually cannot risk rearranging instructions, cannot "forceinline" it's emulated program's function calls, cannot optimize registers across block boundaries, and must rely heavily on memory remapping techniques. None of these things bode well for the 'advantages' of x64. And that's why we opted to stick with x32.
No. I mean yea sure it might help a couple specific scenes in some games, but most of Pcsx2's problem is simply the nuances of recompilation and PS2 hardware emulation (events, irqs, etc) -- these things all require register flushes and they happen often enough that Pcsx2's recompilers rarely get a chance to max out the current XMM register allocation as it is. x64's extra 8 registers would help startup times and maybe some FMV playback, but would be nearly useless for the bulk of VU1 work (which has pipeline problems in addition to ~64 registers that make register allocation impossible at any level, unless we had about 128 registers handy). Note: VU1 is currently the slowest part of the emulator, so if it doesn't help there, it's not going to be worthwhile. Furthermore, Pcsx2's system of MIPS register mapping is quite fast, with the whole table persisting in the L1 cache most of the time, which is basically as fast as actual cpu register use. Optimizing these writebacks to additional registers helps reduce code size, which is good for L1/L2 cache pollution -- but in the end doesn't speed up the code by much, because:
Myth #2: x64 doesn't inherently suck in design!
It sure does. x64 brings a lot of baggage with it, namely in the form of code/data bloat. Using x64 registers requires a prefix byte. Lookup tables become 8-bytes per entry instead of 4. Other LUTs must be converted to base+offset form instead of direct 32-bit address form because they're too large to afford to double (going from 80MB to 160MB for our primary recompiler LUT would suck, for example). In addition, x64's addressing system is flawed and often requires either a pair of instructions and/or a register allocated specifically for the purpose of fetching/storing data. All of these increase code and data size, pollute the L1/L2 caches, and negate most of the (limited) benefit of additional XMM registers cleaning up the caches.
For these reasons, the x64 memory addressing model would incur severe penalties both in our memory emulation layer (vtlb) and our recompiled block lookup table, due to increased table size, and/or increased number of operations to handle them. These are the most frequently executed code parts of the EE, thus they matter a lot.
Myth #3: The PS2's EE is 64 bits, so surely the 64 bit functions of x64 would be a big speedup!
Not really. Most of the EE's 64 bit functions are pseudo-64. They're actually 32 bits arithmetic where the result is sign-extended into the upper 64 bits. x86/32 includes a special instruction for such sign extension (CDQ), and it's quite fast. Additionally, the recompiler performs liveness checks on the upper 32 bits and a good portion of the time the upper 32 bits is known to be constant or preserved. In those cases the current rec simply uses 32 bit math regardless, so x64 would be no gain at all for those. Moreover, a handful of the EE's 64 bit operations can be done using MMX or XMM registers and instructions, so those too would see no change when going from x32 to x64. So yeah, x64 would be a minor benefit, but none too significant, and easily outweighed by the extra overhead incurred by the recompiler and vtlb LUTs (which are both more frequently executed code anyway).
Myth #4: Clearly no one has done proper experimentation or benchmarked Pcsx2 under x64 recompiler conditions!
Woops? Versions 0.9.4 and 0.9.5 SVNs had working x64 builds. I guess someone forgot to do their research. Those builds were also roughly 10-15% slower than the x32 build on most in-game scenes (although rumor has it, the BIOS ran slightly faster, for what that's worth!). And indeed they were not at all well optimized, but there wasn't much point in continuing the work anyway since it was clear from an educated standpoint that it wasn't going to reap many benefits (if any).
Moral of the Story: Compiled programs have tremendous optimization advantages over recompiled code in the land of x64, so looking at a 3Dmark compiled in x64 has no bearing on Pcsx2's performance scale. The PS2 doesn't need much actual 64-bit arithmetic, and add to that an MMX unit available for using certain types of 64 bit arithmetic and memory operations even on 32bit (something most compilers fail to make much use of in compiled code, but Pcsx2 uses quite effectively), and you start to see the gap closing quickly. Additionally, few apps work in such an "inefficient" manner as an emulator -- in that the emulator usually cannot risk rearranging instructions, cannot "forceinline" it's emulated program's function calls, cannot optimize registers across block boundaries, and must rely heavily on memory remapping techniques. None of these things bode well for the 'advantages' of x64. And that's why we opted to stick with x32.
Jake Stine (Air) - Programmer - PCSX2 Dev Team