[blog] Introduction to Dynamic Recompilation

Nexxxus Offline

Nerd

Posts: 430
Threads: 18
Joined: Dec 2008
Reputation: 4
Location: Germany

#31

10-14-2011, 09:58 AM

okay I have a question: What's actually with the AVX instructions? couldn't it get used to speed up the VU's?
I know that GSdx uses it, but I don't notice any difference compared to SSE4.

Find

Reply

refraction Offline

PCSX2 Coder

Posts: 20.325
Threads: 405
Joined: Aug 2005
Reputation: 554
Location: England

#32

10-14-2011, 10:25 AM

At the moment the AVX instruction set is very limited, but yes the idea of being able to do multiplies and adds within the same instruction could provide some nice little speed ups on the vu's and the ee core (as there are MADD instructions on that too)

Website Find

Reply

Squall Leonhart Offline

Jarrett Killer.

Posts: 3.559
Threads: 21
Joined: Jul 2010
Reputation: 61
Location: Australia

#33

10-14-2011, 12:38 PM

SSE+FMA is faster on BD than AVX+FMA according to limited developer tests.

VBA-M

Find

Reply

rama

PCSX2 coder

Posts: 7.414
Threads: 66
Joined: Nov 2008
Reputation: 122
Location: Germany

#34

10-14-2011, 01:08 PM (This post was last modified: 10-14-2011, 01:10 PM by rama.)

So is FMA on BD fast at all? Tongue2

The problem we're facing is that someone will have to code and test FMA instructions in PCSX2, and from the recent BD release it doesn't look like anyone on the team will buy a BD cpu.
The first chance then to actually try out FMA would be with Ivy Bridge in (hopefully) Q1 2012.

Website Find

Reply

gregory Offline

Linux PCSX2 coder

Posts: 6.069
Threads: 68
Joined: May 2010
Reputation: 167
Location: Grenoble, France

#35

10-14-2011, 08:07 PM

Isn't FMA coming from haswell? Or did they change it? It would be begin of 2013 Sad

Or maybe next AMD chip on Q3 2012 Tongue2

Find

Reply

pseudonym Offline

PCSX2 coder

Posts: 195
Threads: 0
Joined: Feb 2009
Reputation: 5

#36

10-14-2011, 08:28 PM

We'll see if we can get gains from it when someone can test it, okay? (Actually it looks like a bit of a pain on the emitter side. But there's not much point doing anything about it right now.)

Find

Reply

Squall Leonhart Offline

Jarrett Killer.

Posts: 3.559
Threads: 21
Joined: Jul 2010
Reputation: 61
Location: Australia

#37

10-15-2011, 12:25 AM

Quote:if you look the discussion here :
http://www.planet3dnow.de/vbulletin/show...ost4501020

"
2011-09-28 01:33:41 < Dark_Shikari> AVX mbtree propagate is slower than sse2
2011-09-28 01:33:49 < Dark_Shikari> FMA only barely manages to get it fast again.
2011-09-28 01:33:49 < kemuri-_9> lol
2011-09-28 01:33:52 < Sean_McG> hahah
2011-09-28 01:33:59 < Dark_Shikari> SSE2: 342 cycles
2011-09-28 01:34:00 < Dark_Shikari> AVX: 374
2011-09-28 01:34:05 < Dark_Shikari> FMA4: 340
[...]
2011-09-28 01:35:18 < Dark_Shikari> Hmm. I wonder if FMA4 supports sse registers?
2011-09-28 01:35:37 < Dark_Shikari> Oh. It *does*...
2011-09-28 01:35:38 < Dark_Shikari> Let me try that.
2011-09-28 01:37:45 * codestr0m ears perk up
2011-09-28 01:49:29 < Dark_Shikari> FMA4: 314 cycles. Much better
"

these guys remarked a slowndown when going from legacy SSE (128-bit) to 256-bi AVX, then got back the baseline score with 256-bit AVX + FMA4, then eventually got a sizable speedup with 128-bit AVX + FMA4

based on these observations I'll say that Bulldozer supports AVX-256 just for compatibility sake but it is probably better (TBC) to not enable AVX-256 for Bulldozer targets. It gives a refreshing new perspective on the issue of the Intel compiler enabling SSEx optimization only on Intel CPUs, since in this case it may well be a *legit optimization to disable AVX-256 for Bulldozer*, i.e. not only rely on the features flag but to look at the manufacturer string ("Genuine Intel", "Authentic AMD")

Not as fast as it's been hoped

VBA-M

Find

Reply

cottonvibes Offline

Pencil Sharpener

Posts: 730
Threads: 29
Joined: Nov 2008
Reputation: 20

#38

10-15-2011, 05:45 AM

FMA instructions aren't really a miracle speedup, they'll just remove 1 instruction from certain VU opcodes.

Check out my blog: Trashcan of Code

Website Find

Reply

Squall Leonhart Offline

Jarrett Killer.

Posts: 3.559
Threads: 21
Joined: Jul 2010
Reputation: 61
Location: Australia

#39

10-15-2011, 05:54 AM

but those certain vu opcodes might get a massive boost in certaim games Tongue2

VBA-M

Find

Reply

Gabest Offline

Plugin Author

Posts: 288
Threads: 2
Joined: Sep 2005
Reputation: 12

#40

10-15-2011, 01:56 PM (This post was last modified: 10-15-2011, 01:58 PM by Gabest.)

(10-14-2011, 12:38 PM)Squall Leonhart Wrote: SSE+FMA is faster on BD than AVX+FMA according to limited developer tests.

Bulldozer does not have a true 256 bit processing units like sandy, it's just there to be compatible with the new AVX instructions, the throughput should be similar to SSE. FMA could be useful in a few cases, but I'm still waiting for integer AVX. Also, XOP looks interesting, I'm going to get a Bulldozer as soon as I can to try that.

DirectX End-User Runtime Web Installer
Donate via PayPal!

Find

Reply

Login
Username:
Password:	Lost Password?
	Remember me