Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[blog] Nightmare on Floating-Point Street
#11
(12-04-2010, 11:53 PM)Air Wrote: SSE2 introduced doubles support for almost everything -- but the other almost-as-simple answer is because its roughly 5x-6x slower than using floats. It'd require a near-complete rewrite of the VU recompilers, and expect games that run 60-70fps currently to run 20. Happy
is it possible somehow to treat every 2 floats as a double? or maybe just use a 'shadow' float for better round/clamp results? like shifting the shadow few bits/bytes to the right, and performing the same action on it, which will allow more concrete results before it reaches INF? or something along those lines?

Also, maybe don't enter the clamp code on cases where it's clear it won't be needed (-> possible speed increase)

[i7-3630qm/gt650m-2G/Win-7] [i7-4500u/R.HD8850m/Win-8.1] [2010-MBA/OSX-10.9.x]. Scroll smoothly with SmoothWheel for Firefox.
Reply

Sponsored links

#12
(12-05-2010, 12:32 AM)avih Wrote: is it possible somehow to treat every 2 floats as a double? or maybe just use a 'shadow' float for better round/clamp results? like shifting the shadow few bits/bytes to the right, and performing the same action on it, which will allow more concrete results before it reaches INF? or something along those lines?

Floats can't really bitshift. They don't work that way. The computations required to 'shadow' float would be more complicated and slow than using doubles, and wouldn't be especially more accurate than the current clamping anyway.

Quote:Also, maybe don't enter the clamp code on cases where it's clear it won't be needed (-> possible speed increase)

Already done. That's why we have several clamping modes, since most games work pretty well with no clamping or minimal clamping. And condition checking for clamping would be much slower than just executing the clamping, in case that's what you meant. Conditionals, even on x86 cpus, are not terribly fast and have to be avoided at almost any cost in performance-critical code. One branch conditional is roughly equivalent to 3-5 regular instructions, depending on the cpu pipeline states and branch prediction reliability. Worse, SSE doesn't have a compare instruction built for simple value tests (its meant for doing comparison tests over a large array of data), so SSE conditionals on a single set of 4 vectors incurs even more overhead than the average branch comparison.
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#13
(12-04-2010, 11:53 PM)Air Wrote: SSE2 introduced doubles support for almost everything -- but the other almost-as-simple answer is because its roughly 5x-6x slower than using floats. It'd require a near-complete rewrite of the VU recompilers, and expect games that run 60-70fps currently to run 20. Happy

Are you sure?
Reply
#14
(12-05-2010, 07:03 AM)darkdancer Wrote: Are you sure?

Well considering he helped write those recompilers... I'd say so. It would definitely require a rewrite of the recs, since the ps2's FPU and vector units operate only on single precision floats, so the recs would be designed to use floats.
Reply
#15
(12-05-2010, 07:03 AM)darkdancer Wrote: Are you sure?

yes, using doubles will be a lot slower and more complex than the current implementation, and it still won't solve all the incompatibility problems and could possibly introduce new ones. one thing which comes to my head, is if the float is already a NaN/Inf, and you convert it to double, then probably you get a NaN/Inf double result. And if you convert a double to float, and the double is holding a value a float can't represent, it might just generate a NaN/Inf in the float (i'm not sure about these things since i've never tried them). But my point is that you might even have to generate your own custom conversion operations, which would be a lot slower than the built-in SSE functions.

everytime a float is loaded it would have to be converted to its double equivalent.
but not only that, since the VUs are vector based processors you don't usually load 1 float, but instead 4 floats at the same time. So that's some noticeable overhead right there.
when you go and save the registers, you would either have to convert back to float (especially when writing to vu memory), or you would have to manage separate double and float reg caches (which would probably be messy, complex, and not worth it).

The next problem is that you can fit at-most 2 doubles in an XMM reg, whereas using 32-bit floats you can fit 4 floats in an xmm reg (like the VUs can fit 4 floats in their VU regs). This kills any good chance of regalloc, and will require massive changes to the VU recompilers.
Using doubles will complicate the emulated instruction calculations a lot and require more SSE instructions to accomplish the same task as is currently does with 32bit floats.

Lastly processors just are horribly slow at performing double arithmetic calculations. I don't think intel is as bad, but amd is very bad at it.
We use doubles on the mini/max instructions of the VUs, and back on my AMD cpu i would get a noticeable ~1.5% speedhit in games from just those 2 instructions; now imagine if we were to convert 100 more instructions to doubles. It would just make pcsx2 so slow it wouldn't be worth it.

Edit:
The last point is that in pcsx2 nneeve already made the EE's FPU use doubles for its calculations (this is what FPU Clamp mode FULL does). There are a few game problems which this fixes, and I believe it breaks at least one game (can't remember).
But my point is how often do you have to use FPU's FULL clamp mode?
Very rarely; so if doing the same thing with the VUs will require a crap-load of work and be very slow, it really doesn't seem worth it.
Check out my blog: Trashcan of Code
Reply
#16
(12-05-2010, 06:29 AM)Air Wrote: Floats can't really bitshift...
...
Already done. ...
Thanks. Does the interpreter fully emulate the PS2 in this regard? or is it also susceptible to the same round/clamp inaccuracies?

(12-05-2010, 08:43 AM)cottonvibes Wrote: ...
Edit:
The last point is that in pcsx2 nneeve already made the EE's FPU use doubles for its calculations (this is what FPU Clamp mode FULL does). There are a few game problems which this fixes, and I believe it breaks at least one game (can't remember).
But my point is how often do you have to use FPU's FULL clamp mode?
Very rarely; so if doing the same thing with the VUs will require a crap-load of work and be very slow, it really doesn't seem worth it.
Just quickly tested how slow is it really (None compared to Full/Extra+sign). It's slower but nothing close to an order of magnitude.

Using (the not absolutely accurate) pcsx2bench on a single 15s replay section from Gran Turismo 4 NTSC (very close to 100% EE all of the time, GS=~50%), with few speed hacks enabled, I got the following Average emulation speeds on my system (consistent to about 1% for each config after 4 runs with each):

Both EE and VU clamp modes at None: 122%
EE Clamp = Full, VU clamp = None: 117% (-4%)
EE clamp = None, VU clamp = Extra + preserve sign: 108% (-~11%)
EE Clamp = Full, VU clamp = Extra + preserve sign: 104.5% (-~14%)

So we're talking 15% on a system such as mine (and as much as this short test is representative). It's not impossible that on an even faster system and with some speed hacks, 15% slower will still be 100% speed at all times. On such system I'd probably leave it at Full/Extra+sign just for the sake of not having to try different setting each time, unless I need the extra speed.

Computers are getting faster, and even today the more accurate clamp modes can still be plenty fast on modern systems within the current PCSX2.

I guess what I'm saying is, that if a path to an even more accurate emulation is known, it might be worth exploring even if it is known to produce a non-negligible speed hit.
[i7-3630qm/gt650m-2G/Win-7] [i7-4500u/R.HD8850m/Win-8.1] [2010-MBA/OSX-10.9.x]. Scroll smoothly with SmoothWheel for Firefox.
Reply
#17
Just take another game with different needs.
I simply tried a random game and got 15% speed hit with Full EE clamp against 'None', 5% with Extra+preserve, 2% with Normal. This is with fraps and well you can see one game doesn't tell the whole story (that's a 10% hit between 'Extra+Preserve' and 'Full' clamps while between 'None' and Extra+preserve there was half of that and you got different results with another game Tongue2).
Core i5 3570k -- Geforce GTX 670  --  Windows 7 x64
Reply
#18
Obviously, my tests weren't general, and I wasn't claiming my results are representative, but I'm guessing they give a rough idea of what the effect is, partially due to the properties of the specific section I chose. Naturally, there will always some games that give dramatically different results.

Also, I never suggested removing 'None' as a possible clip mode Wink
[i7-3630qm/gt650m-2G/Win-7] [i7-4500u/R.HD8850m/Win-8.1] [2010-MBA/OSX-10.9.x]. Scroll smoothly with SmoothWheel for Firefox.
Reply
#19
(12-05-2010, 11:47 AM)avih Wrote: I guess what I'm saying is, that if a path to an even more accurate emulation is known, it might be worth exploring even if it is known to produce a non-negligible speed hit.

my point is that in practice it won't be significantly better than what we have now in terms of game compatibility; that is games will still have problems due to problems i already mentioned and the extra precision with double arithmetic.

if this was guaranteed to solve all the problems i would have implemented it, but i considered the option and figured it would at-most fix a handful of a games' graphical glitches and most-likely break a few random games as well.

the only way to guarantee good results is to have a software FPU which mimics the floating point operations the ps2's VU performs at a bitwise level. then use this software FPU code in every floating point operation in pcsx2.
(its possible that for example bad floating point data is tranfered to the VUs due to floating point errors somewhere else in pcsx2)

(12-05-2010, 02:08 PM)avih Wrote: Obviously, my tests weren't general, and I wasn't claiming my results are representative, but I'm guessing they give a rough idea of what the effect is, partially due to the properties of the specific section I chose. Naturally, there will always some games that give dramatically different results.

Also, I never suggested removing 'None' as a possible clip mode Wink

In terms of speed comparison, the FPU's version is no where comparable to the VU's.
The FPU doesn't have vector operations, so they don't have the 4-float problem i already talked about.

The second issue is the VUs are run significantly more than the FPU.
You can run the interpreter FPU and not notice much of a speed hit in games.
If you run the interpreter VUs you will probably get 1fps~20fps in games that were originally 60fps+.

The overhead and speed penalty involved in accurately using double operations in the VUs, is similar to running the VUs in interpreter mode.
Check out my blog: Trashcan of Code
Reply
#20

Do i have no business saying this but don't video cards handle complicated math like floating points and stuff like that. Instead of using the processor's instruction set why not use the video cards ability to create this math solving.

Reason I say this is cause there are programs which use peoples video cards to do complex calculations in which the processor wasn't made to do them.

When you interpret graphics in the end it's all a bunch of math.

But i'm just a baka and you guys have already thought of this.
[Image: 66xk69.jpeg]
Reply




Users browsing this thread: 1 Guest(s)