Is it posible to compile PCSX2 in 64-bit?
#11
Myth #1: More registers would surely help Pcsx2!

No. I mean yea sure it might help a couple specific scenes in some games, but most of Pcsx2's problem is simply the nuances of recompilation and PS2 hardware emulation (events, irqs, etc) -- these things all require register flushes and they happen often enough that Pcsx2's recompilers rarely get a chance to max out the current XMM register allocation as it is. x64's extra 8 registers would help startup times and maybe some FMV playback, but would be nearly useless for the bulk of VU1 work (which has pipeline problems in addition to ~64 registers that make register allocation impossible at any level, unless we had about 128 registers handy). Note: VU1 is currently the slowest part of the emulator, so if it doesn't help there, it's not going to be worthwhile. Furthermore, Pcsx2's system of MIPS register mapping is quite fast, with the whole table persisting in the L1 cache most of the time, which is basically as fast as actual cpu register use. Optimizing these writebacks to additional registers helps reduce code size, which is good for L1/L2 cache pollution -- but in the end doesn't speed up the code by much, because:

Myth #2: x64 doesn't inherently suck in design!

It sure does. x64 brings a lot of baggage with it, namely in the form of code/data bloat. Using x64 registers requires a prefix byte. Lookup tables become 8-bytes per entry instead of 4. Other LUTs must be converted to base+offset form instead of direct 32-bit address form because they're too large to afford to double (going from 80MB to 160MB for our primary recompiler LUT would suck, for example). In addition, x64's addressing system is flawed and often requires either a pair of instructions and/or a register allocated specifically for the purpose of fetching/storing data. All of these increase code and data size, pollute the L1/L2 caches, and negate most of the (limited) benefit of additional XMM registers cleaning up the caches.

For these reasons, the x64 memory addressing model would incur severe penalties both in our memory emulation layer (vtlb) and our recompiled block lookup table, due to increased table size, and/or increased number of operations to handle them. These are the most frequently executed code parts of the EE, thus they matter a lot. Sad

Myth #3: The PS2's EE is 64 bits, so surely the 64 bit functions of x64 would be a big speedup!

Not really. Most of the EE's 64 bit functions are pseudo-64. They're actually 32 bits arithmetic where the result is sign-extended into the upper 64 bits. x86/32 includes a special instruction for such sign extension (CDQ), and it's quite fast. Additionally, the recompiler performs liveness checks on the upper 32 bits and a good portion of the time the upper 32 bits is known to be constant or preserved. In those cases the current rec simply uses 32 bit math regardless, so x64 would be no gain at all for those. Moreover, a handful of the EE's 64 bit operations can be done using MMX or XMM registers and instructions, so those too would see no change when going from x32 to x64. So yeah, x64 would be a minor benefit, but none too significant, and easily outweighed by the extra overhead incurred by the recompiler and vtlb LUTs (which are both more frequently executed code anyway).

Myth #4: Clearly no one has done proper experimentation or benchmarked Pcsx2 under x64 recompiler conditions!

Woops? Versions 0.9.4 and 0.9.5 SVNs had working x64 builds. I guess someone forgot to do their research. Those builds were also roughly 10-15% slower than the x32 build on most in-game scenes (although rumor has it, the BIOS ran slightly faster, for what that's worth!). And indeed they were not at all well optimized, but there wasn't much point in continuing the work anyway since it was clear from an educated standpoint that it wasn't going to reap many benefits (if any).



Moral of the Story: Compiled programs have tremendous optimization advantages over recompiled code in the land of x64, so looking at a 3Dmark compiled in x64 has no bearing on Pcsx2's performance scale. The PS2 doesn't need much actual 64-bit arithmetic, and add to that an MMX unit available for using certain types of 64 bit arithmetic and memory operations even on 32bit (something most compilers fail to make much use of in compiled code, but Pcsx2 uses quite effectively), and you start to see the gap closing quickly. Additionally, few apps work in such an "inefficient" manner as an emulator -- in that the emulator usually cannot risk rearranging instructions, cannot "forceinline" it's emulated program's function calls, cannot optimize registers across block boundaries, and must rely heavily on memory remapping techniques. None of these things bode well for the 'advantages' of x64. And that's why we opted to stick with x32.
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply

Sponsored links

#12
Thankyou Jake (Air) for that post, you have answered all my outstanding questions about why x64 would not be beneficial in a very informative way which I very much appreciate. It just all seemed to be counter-intuitive before (surely all those extra registers would be beneficial) whereas now I see that the limiting factors are elsewhere and that the larger size of x64 instructions could slow things down more than speed them up. Yes, I was aware of the increased instruction size and some other associated downsides to x64 mode, but didn't know it had already been tested in earlier versions of the emulator. Sorry.

Anyway, thankyou very much for that comprehensive reply. It has more than answered all my questions.

Interesting about VU1 being the most demanding chip to emulate. I might have another look to see if the game I'm particularly interested in benefits much from the quickest form of FP clamping.

Would there be any benefit from having seperate VU0 and VU1 clamping options, given that VU1 is the most demanding, and VU0 the most prone to problems from what I've read?
CPU: Athlon 64 X2 4400+ (2.2GHz @ solid 2.53GHz)
GPU: nVidia GeForce 8800GTS 640MB (not currently O/C)
Memory: 2GB DDR400 (2x 1GB @ DDR422 2.5-3-2)
Reply
#13
The VU1's demands are heavily dependent on the game -- some games use it little if at all, and others use it a lot. The games with the slowest performance are always VU1-heavy tho, while games that people talk of as running fast even on P4's use little or no of the VU1.

On the AMD the VU1 tends to be especially slow, because the unit is *so* totally reliant on SSE and SSE4.1 (the latter not being available on AMD yet). Furthermore, some games do some particularly evil types of SSE operations that require a lot of denormalizing, and spin on them a lot just to compound the issue, and the AMDs don't like that very much (the DaZ option on Core2 Duos ends up being a huge speed boost on such games, but the AMD doesn't have an optimized path for DaZ).

It's possible that variant clamping options for the VUs could help to some extent, but I wouldn't bank on it. The problem the core emulator has at the moment is that the COP2 (the VU0 in 'macro' mode) uses the FPU's clamp and roundmode settings. This is done because the COP2's instructions are currently recompiled into the same code block as the EE's other instructions -- including FPU, EE MMI, EE core, etc. And to switch roundmodes every time we flipped/flopped from a VU0 to an FPU instruction would be a serious performance hindrance. But yeah, that's why the VU0 tends to be more prone to errors, because when games use it in macro/COP2 mode, it's roundmode and clamping are wrong. Sad
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#14
Thank you a lot. Now i see how ridiculous and boring is when people cry that games are slow, or have errors, and keep yielding "can you fix this, can you fix that". Just one more question and i won't bother you anymore. PC proc. have 32 registers (x86). Does dual cores still have only 32 registers or they can use 2 x 32 registers (=64). I belive that you have allready considering this, but i would like to know, becouse:
Air: x64's extra 8 registers would help startup times and maybe some FMV playback, but would be nearly useless for the bulk of VU1 work (which has pipeline problems in addition to ~64 registers that make register allocation impossible at any level, unless we had about 128 registers handy).
Now, if the duals and quads are not limited still to 32 registers, can teoreticaly quads use 4x32 in x86, or 4x64 in x64. From what i have read on this topic, it sounds that the worse scenario is more likely. Thank you again.
Reply
#15
If you have no idea what you're talking about (And you clearly don't, registers are not shared across CPUs), don't spend your time trying to be "helpful" coming up with bizarre ideas. If the PCSX2 coders haven't thought of it, it's not a workable idea (Again, only applies to those who don't know what they're talking about. If you couldn't take the PCSX2 source and implement your idea, given enough time, odds are you don't).
Reply
#16
(04-11-2009, 11:35 PM)Gundark Wrote: PC proc. have 32 registers (x86). Does dual cores still have only 32 registers or they can use 2 x 32 registers (=64). I belive that you have allready considering this, but i would like to know, becouse:

Wait, 32?

It's been a bit since I've programmed low enough to need to know this, but aren't there only 8 general-purpose register in x86? EAX, EBX, ECX, EDX, Stringe Source/Desitnation, Stack pointer and Stack base?
There may be 32 if you include the FPU and SIMD (MMX, SSE, etc) but as far as I know they don't really count. Anyone that codes SIMD instructions by hand will know better (and is certifiably insane) so I'll defer that to someone smarter/crazier than I.

In any case, a true dual core processor has two distinct cores, each can only access their own registers. They can pass information between themselves, but (again, as far as I know) they don't share registers.
"This thread should be closed immediately, it causes parallel imagination and multiprocess hallucination" --ardhi
Reply
#17
32 and 64-bit refer to the width of general purpose registers. Nothing to do with their number.

Every "core" has its own set of registers. They are pretty much independent CPUs with shared caches.
Quote:EAX, EBX, ECX, EDX, Stringe Source/Desitnation, Stack pointer and Stack base

In 64-bit mode you also get RAX, RBX... 64-bit registers that share the lower 32-bit with EAX, EBX... (like how EAX shares 16-bits with AX). You also get a few more brand new 64-bit registers in 64-bit mode (R8-R15).
Reply
#18
(04-12-2009, 12:12 AM)cyberfish Wrote: 32 and 64-bit refer to the width of general purpose registers. Nothing to do with their number.

Every "core" has its own set of registers. They are pretty much independent CPUs with shared caches.
Quote:EAX, EBX, ECX, EDX, Stringe Source/Desitnation, Stack pointer and Stack base

In 64-bit mode you also get RAX, RBX... 64-bit registers that share the lower 32-bit with EAX, EBX... (like how EAX shares 16-bits with AX). You also get a few more brand new 64-bit registers in 64-bit mode (R8-R15).

Oh good, then I hadn't missed a rather major technical innovation then. It's good to know AMD continued x86's fine tradition of ridiculous naming.

Going to leave now, before this thread derails any more lol.
"This thread should be closed immediately, it causes parallel imagination and multiprocess hallucination" --ardhi
Reply
#19
OK. No more post from me either. Thanks.
Reply
#20
So why have the gamecube emulator DOLPHIN a strong boost on 64 bit?

I've tested mario galaxy on Win 7 x64

In game the 32 bit build reach 17 fps while the 64 bit build reach 21 fps.

I'm just curious.
Reply




Users browsing this thread: 1 Guest(s)