[blog] Introduction to Dynamic Recompilation
#31
okay I have a question: What's actually with the AVX instructions? couldn't it get used to speed up the VU's?
I know that GSdx uses it, but I don't notice any difference compared to SSE4.

Main Rig: i7-3770k @4.5ghz | 16GB DDR3 | Nvidia GTX 980 TI | Win 10 X64
Laptop: MSI GT62VR | i7-6700HQ | 16GB DDR4 | Nvidia GTX 1060 | Win 10 X64

Reply

Sponsored links

#32
At the moment the AVX instruction set is very limited, but yes the idea of being able to do multiplies and adds within the same instruction could provide some nice little speed ups on the vu's and the ee core (as there are MADD instructions on that too)
[Image: ref-sig-anim.gif]

Reply
#33
SSE+FMA is faster on BD than AVX+FMA according to limited developer tests.
Reply
#34
So is FMA on BD fast at all? Tongue2

The problem we're facing is that someone will have to code and test FMA instructions in PCSX2, and from the recent BD release it doesn't look like anyone on the team will buy a BD cpu.
The first chance then to actually try out FMA would be with Ivy Bridge in (hopefully) Q1 2012.
Reply
#35
Isn't FMA coming from haswell? Or did they change it? It would be begin of 2013 Sad Or maybe next AMD chip on Q3 2012 Tongue2
Reply
#36
We'll see if we can get gains from it when someone can test it, okay? (Actually it looks like a bit of a pain on the emitter side. But there's not much point doing anything about it right now.)
Reply
#37
Quote:if you look the discussion here :
http://www.planet3dnow.de/vbulletin/show...ost4501020

"
2011-09-28 01:33:41 < Dark_Shikari> AVX mbtree propagate is slower than sse2
2011-09-28 01:33:49 < Dark_Shikari> FMA only barely manages to get it fast again.
2011-09-28 01:33:49 < kemuri-_9> lol
2011-09-28 01:33:52 < Sean_McG> hahah
2011-09-28 01:33:59 < Dark_Shikari> SSE2: 342 cycles
2011-09-28 01:34:00 < Dark_Shikari> AVX: 374
2011-09-28 01:34:05 < Dark_Shikari> FMA4: 340
[...]
2011-09-28 01:35:18 < Dark_Shikari> Hmm. I wonder if FMA4 supports sse registers?
2011-09-28 01:35:37 < Dark_Shikari> Oh. It *does*...
2011-09-28 01:35:38 < Dark_Shikari> Let me try that.
2011-09-28 01:37:45 * codestr0m ears perk up
2011-09-28 01:49:29 < Dark_Shikari> FMA4: 314 cycles. Much better
"

these guys remarked a slowndown when going from legacy SSE (128-bit) to 256-bi AVX, then got back the baseline score with 256-bit AVX + FMA4, then eventually got a sizable speedup with 128-bit AVX + FMA4

based on these observations I'll say that Bulldozer supports AVX-256 just for compatibility sake but it is probably better (TBC) to not enable AVX-256 for Bulldozer targets. It gives a refreshing new perspective on the issue of the Intel compiler enabling SSEx optimization only on Intel CPUs, since in this case it may well be a *legit optimization to disable AVX-256 for Bulldozer*, i.e. not only rely on the features flag but to look at the manufacturer string ("Genuine Intel", "Authentic AMD")

Not as fast as it's been hoped
Reply
#38
FMA instructions aren't really a miracle speedup, they'll just remove 1 instruction from certain VU opcodes.
Check out my blog: Trashcan of Code
Reply
#39
but those certain vu opcodes might get a massive boost in certaim games Tongue2
Reply
#40
(10-14-2011, 12:38 PM)Squall Leonhart Wrote: SSE+FMA is faster on BD than AVX+FMA according to limited developer tests.

Bulldozer does not have a true 256 bit processing units like sandy, it's just there to be compatible with the new AVX instructions, the throughput should be similar to SSE. FMA could be useful in a few cases, but I'm still waiting for integer AVX. Also, XOP looks interesting, I'm going to get a Bulldozer as soon as I can to try that.
Reply




Users browsing this thread: 1 Guest(s)