[blog] Benchmarking Multithreaded PCSX2
#1
As most people probably know, PCSX2 is primarily a dual-thread application. The two main threads are described as such:
  • EE/Core thread emulates the PS2's EmotionEngine (including VIF, SIF, GIF, and VUs) and the IOP (including SPU2, CDVD, and PAD)
  • GS thread emulates the PS2's Graphic Synthesizer (includes texture swizzling, texture filtering, upscaling, and frame rendering)
Each thread relies on the other thread in some way -- the GS thread cannot swizzle texture data until the EE thread has uploaded said data, for example. Meanwhile, the EE thread cannot upload texture data to the GS thread if the GS thread is currently bogged down rendering last week's frame to video. During these periods, either thread will sleep, only to be woken up once the other thread has caught up in its workload.

In theory the act of sleeping the EE/GS threads should make benchmarking the CPU load registered by each thread pretty easy: all modern operating systems have built-in APIs for reading the busy/idle time of any thread on the system -- this is the same API used by your tried and true task/process manager, for example:

[Image: attachment.php?aid=25629]
(Air shows off his personal favorite, ProcessExplorer, part of the SysInternals Suite)

This readout is simple, efficient, and seemingly reliable. It also avoids a lot of the annoying pitfalls one runs into trying to use common alternatives such as rdtsc and QueryPerformanceCounter.

... and this is precisely the method I decided to use for PCSX2 0.9.7.r3113 (and still in use as of r3878). Simple theory really: if the GS thread is sleeping a lot (low load) then the game is bottlenecked by EE/Core thread activity. If the EE thread is sleeping a lot and the GS thread reports 90+%, then the GS thread is the bottleneck (a problem often correctable through using lower internal resolutions, for example).

But as I've recently found out, it doesn't work as expected. -_-

It's filled with... threads!

The immediate problem faced by this simple method of load detection is that the latest wave of Windows Vista/7 GPU drivers themselves are multithreaded. It should have come as little surprise that one of the primary goals of the new DWM/Aero/DX11 systems implemented into Vista/7 is scalable parallel processing that takes better advantage of modern multi-core CPUs. Why this causes the OS built-in thread load detection to fail might be less obvious; I'll explain with an example:

When the GPU driver receives a directive to render the current scene (aka 'Present' in DirectX lingo), it sends the job to a thread dedicated to the task. That thread has a Present Queue, typically 1 or 2 frames deep, that automatically handles triple buffered vsync'd page updates. If the queue is full when the PCSX2 GS thread issues its next Present request, the GPU driver will put the GS thread to sleep until a slot in the Present Queue becomes available. End result: The GS thread reports idle time to the operating system (and to PCSX2's GS window), but the GPU is still quite overloaded and bottlenecked via work supplied to it by a different thread altogether.

In essence, it is nearly the same sort of inter-thread dependence that the EE/Core and GS threads have between each other, only now the EE/Core thread's dependency chain extends to include GS and GPU driver threads (of which there could be one or many).

The solution to this problem is to use a more traditional method of manual load checking: timing various sections of code executed in-thread via either the aforementioned rdtsc (timestamp) or QueryPerformanceCounter, read at key points in the GS thread's execution/program flow. This wasn't such a great idea a few years ago, due to K8/Athlon and P4 generation CPUs lacking a stable internal clock counter. Fortunately, all modern CPUs have a consistent counter suitable for benchmarking, so the pitfalls that have been long associated with using Intel/AMD timestamps are finally obsolete enough to not be a concern for us here.


Attached Files Thumbnail(s)
       
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply

Sponsored links

#2
Oh and even more fun with the Advent of DX 11 graphics drivers are supposed to be multithreaded as well. So far it seems that it has yet to have been implemented by Nvidia and AMD though.
Reply
#3
Does grafic plugins like Gsdx, have a role of grafic sintesizer? So that means one tread is used for emulator itself and other for grafic plugin?
Reply
#4
dralor:
Direct X is long multithreaded. That got introduced with DX10 already and yeah, GSdx uses it Wink
Game:
Any graphic plugin uses at least one thread to do processing.
Reply
#5
How about making EE/IOP thing work together on two diffrent treads insted on one? Or it is just stupid Smile
Or puting grafic tread on two treads.
Reply
#6
(10-12-2010, 05:41 PM)Game Wrote: How about making EE/IOP thing work together on two diffrent treads insted on one? Or it is just stupid Smile
Or puting grafic tread on two treads.

In the case of EE/IOP, I'm guessing the extra code needed to keep the two threads in sync would make it slower than a single threaded instance.

GSdx already has an option to use more than one thread.
Specs in Profile
Reply
#7
Yea, we'd also have to do the sync via a pretty much perfect SIF emulation.
SIF is problematic even today on the same thread so a threaded IOP will take a while.
It'd be a nice change though (if it ever works) for another reason:
The sound can be made more stable if not "directed" by the EE emulation anymore Smile
Reply
#8
Are VU0 and VU1 on gs tread?
Reply
#9
No. Seriously, if you have no idea what you're talking about, please do some research first or stop posting totally wrong/useless questions.
[Image: newsig.jpg]
Reply
#10
I am no expert but I know what I talk about.Curently it is a lot of work to seprerate EE and IOP at least until new dma controler,but I was just guessing.
Reply




Users browsing this thread: 1 Guest(s)