Is it considered cheating if I read the github thread beforehand
Yes but I don't understand how the compiler manage so badly. AVX is composed of 2 parts
* 256 bits registers, we can call it pure AVX
* recoding of SSE operation to use only the lower 128 bits. Upper 128 bits are zeroed. It is interesting because it makes instruction shorter and allow to use a non-destructive form of SSE (ie. a = b + c instead of a += b)
Note1: GSdx AVX1 SW renderer is actually the latter (AKA pseudo AVX on 128 bits reg).
Note2: GSdx uses intrinsics for texture conversion. If you look at the code, you have special code for SSE41 and AVX2 (hence the real speed boost of those 2 builds).
It would worth a new test of the SW renderer on
1/ AVX1 build
2/ SSE4.1 new build (will use AVX internally)
3/ SSE4.1 old build (will use SSE41 internally)
If it don't suffer of SSE/AVX transition penalty, we can drop AVX1 and SSSE3.
On the 64 bits topic, I made a quick test this morning (SotC dump). The JIT implements both a Vertex and a Pixel shaders equivalents. I replaced all PS with a nop.
full-emulation:20-27 fps (off the top of my head)
x86: 93 fps
x64: 103 fps
It is interesting because it means C + VS code are faster on x64. But the PS kills all the perf.
Note: I need to check but I think only PS can run in separate threads.
Remaining question is : why PS perf is so bad? Code is quite close of x86 and it is more compact. Some potential hints
* The relative addressing of x64 is costlier than x86
* Textures allocation are all around the virtual address space which create more TLB misses
* It could be faster to use more stack mov for barely used registers. So they can be used to reduce register dependencies on heavy code (aka linear filtering).
Note: by the way it would be possible to put a trade-off between rendering quality and speed on the SW renderer. Hot spot on the SW renderer is likely mipmap and bilinear filtering.
I just tried a quick benchmark of the ICO intro AVX x86 vs x64 with the same settings and I got very similar results, but the x64 seems slightly faster. During the most taxing moment the x86 version ran at 6.6 FPS, and the x64 at 6.8 FPS. Other scenes also had the x64 marginally ahead or matching x86 performance.