Login

dogen · 11-19-2016, 06:56 PM

(11-19-2016, 06:49 PM)refraction Wrote: If we can do that for all of it, it would be brilliant to cut down to just 1 plugin. Won't all them if statements kill the speed though?

Won't you just check once at the beginning?

**gregory** · (This post was last modified: 11-19-2016, 06:59 PM by gregory.)

(11-19-2016, 06:49 PM)refraction Wrote: If we can do that for all of it, it would be brilliant to cut down to just 1 plugin. Won't all them if statements kill the speed though?

I only changed the JIT recompiler of the SW renderer (so generated code is better). All standard C path still use ifdef. Current cost is 0, however it doesn't impact the full code (such as the texture conversion horror).

The real question is what are the speed difference between all those plugins

A quick grep show various SSE41 but few use of SSSE3/AVX. If the latters aren't not in the critical loop, perf change will be small.

Note: AVX2 is still separated as it is really different of SSE/AVX.

**gregory** · 11-19-2016, 07:18 PM

(11-19-2016, 06:56 PM)dogen Wrote: Won't you just check once at the beginning?

No. Well it depends, it is complicated. Some stuff can be checked at the beginning. Some stuff are more dynamic.

Steam Stat:
SSE2: 100%
SSSE3: 92.30%
SSE4.1: 87.20%
AVX1:72.56%

A SSE3 build is only useful for 5% of users (92.30-87.20). Is the speed difference between SSSE3/SSE2 really worth the extra build.

Hum, actually AVX is completely useless now. Speed will be the same as SSE4.1. So at least we can remove a build.

Note: PR was merged in master.

Ōkami Amaterasu · 11-20-2016, 08:31 PM

Is it considered cheating if I read the github thread beforehand

**Blyss Sarania** · (This post was last modified: 11-20-2016, 10:40 PM by Blyss Sarania.)

(11-19-2016, 05:55 PM)gregory Wrote: It would be nice to do some benchmark of SSE2 vs SSSE3 vs SSE4 vs AVX GSdx build on both HW/SW renderer.
SSSE3 seems rather worthless, and likely AVX too.

There's this Nobbs and I did a while back, but it was just one game and it was a while ago: http://forums.pcsx2.net/Thread-Comparing...1-AVX-AVX2

Either way this was the conclusion we reached(actual numbers are in the thread):

Quote:Conclusion:

Well, this has provided some interesting data. We have learned that for hardware mode, SSE4.1 is the fastest by far, but on the Intel side, AVX2 is basically the same. For software, the more advanced instruction sets provide a boost, but that boost is negated when using extra rendering threads except in the case of AVX2. AVX2 provides big benefits to PCSX2 in both hardware and software.

So what should you use? It depends on what your chip supports. Generally for software mode you should use the highest instruction set your chip supports. That's not really startling information. But for hardware mode, SSE4.1 is the fastest.

So it's like this:

Intel chip that supports AVX2: Use AVX2 in all cases. It's fastest in software mode, and the same as SSE4.1 in hardware.

AMD chip that supports AVX: Use AVX for software, but use SSE 4.1 for hardware.

**gregory** · 11-20-2016, 11:54 PM

Who said, I can't push more optimization Tongue2

Unfortunately it is still slow.

The 64 bits code is now more compact. Example on SotC dump. I don't understand why it is slow. I need to check the speed of the HW renderer too.

32 bits:

Quote:GSDrawScanline generated 107744 bytes of instruction
GSSetupPrim generated 10815 bytes of instruction

64 bits:

Quote:GSDrawScanline generated 98968 bytes of instruction
GSSetupPrim generated 10503 bytes of instruction

**gregory** · 11-21-2016, 12:20 AM

@blyss,

Quote: AMD chip that supports AVX: Use AVX for software, but use SSE 4.1 for hardware.

The equation is different now as the SW renderer will always use AVX. However I'm afraid of a speed penalty due to SSE/AVX mix.

I don't understand those numbers. GSdx doesn''t use special code for the AVX build. Intrinsics are upgraded to AVX but it ought to be faster as opcodes are smallers. Maybe compiler did *****.

***refraction*** · 11-21-2016, 02:40 AM

(11-21-2016, 12:20 AM)gregory Wrote: @blyss,
The equation is different now as the SW renderer will always use AVX. However I'm afraid of a speed penalty due to SSE/AVX mix.

I don't understand those numbers. GSdx doesn''t use special code for the AVX build. Intrinsics are upgraded to AVX but it ought to be faster as opcodes are smallers. Maybe compiler did *****.

Quite probably, the plugin is built with AVX optimisations in the compiler when you compile the AVX builds, so that has probably improved the performance of other functions which aren't intrinsics

**gregory** · 11-21-2016, 01:06 PM

Yes but I don't understand how the compiler manage so badly. AVX is composed of 2 parts
* 256 bits registers, we can call it pure AVX
* recoding of SSE operation to use only the lower 128 bits. Upper 128 bits are zeroed. It is interesting because it makes instruction shorter and allow to use a non-destructive form of SSE (ie. a = b + c instead of a += b)

Note1: GSdx AVX1 SW renderer is actually the latter (AKA pseudo AVX on 128 bits reg).
Note2: GSdx uses intrinsics for texture conversion. If you look at the code, you have special code for SSE41 and AVX2 (hence the real speed boost of those 2 builds).

It would worth a new test of the SW renderer on
1/ AVX1 build
2/ SSE4.1 new build (will use AVX internally)
3/ SSE4.1 old build (will use SSE41 internally)

If it don't suffer of SSE/AVX transition penalty, we can drop AVX1 and SSSE3.

On the 64 bits topic, I made a quick test this morning (SotC dump). The JIT implements both a Vertex and a Pixel shaders equivalents. I replaced all PS with a nop.
full-emulation:20-27 fps (off the top of my head)
x86: 93 fps
x64: 103 fps

It is interesting because it means C + VS code are faster on x64. But the PS kills all the perf.
Note: I need to check but I think only PS can run in separate threads.

Remaining question is : why PS perf is so bad? Code is quite close of x86 and it is more compact. Some potential hints
* The relative addressing of x64 is costlier than x86
* Textures allocation are all around the virtual address space which create more TLB misses
* It could be faster to use more stack mov for barely used registers. So they can be used to reduce register dependencies on heavy code (aka linear filtering).

Note: by the way it would be possible to put a trade-off between rendering quality and speed on the SW renderer. Hot spot on the SW renderer is likely mipmap and bilinear filtering.

**FlatOut** · 11-21-2016, 08:51 PM

I just tried a quick benchmark of the ICO intro AVX x86 vs x64 with the same settings and I got very similar results, but the x64 seems slightly faster. During the most taxing moment the x86 version ran at 6.6 FPS, and the x64 at 6.8 FPS. Other scenes also had the x64 marginally ahead or matching x86 performance.

Login
Username:
Password:	Lost Password?
	Remember me

Poll: AVX1 64 bits vs AVX1 32 bits You do not have permission to vote in this poll.
slower : - 10%	5.56%	2	5.56%
same : +/- 5%	27.78%	10	27.78%
faster : + 10%	36.11%	13	36.11%
much faster : + 20%	13.89%	5	13.89%
on fire : + 50%	16.67%	6	16.67%
Total		36 vote(s)	100%