[blog] Thread Counting...
#1
One thing is for sure: The new 0.9.7 betas will use a lot more threads than the current 0.9.6 releases. Now this doesn't necessarily mean the emulator will take advantage of quad core CPUs better than 0.9.6, least not in a gameplay sense. As I explained in my previous blog, threading is as much a function of improving responsiveness and recoverability as it is about sharing a workload across multi-core cpus, and so far most of the threading implemented into 0.9.7 is the scalable/responsive sort.

One of the major changes in 0.9.7 will be the removal of what I call an "aggressive spinwait" in the EmotionEngine (EEcore) emulation unit. A spinwait is a simple loop that waits on a variable to change, like so:

Code:
volatile bool IsRunning = true;
StartThreadedAction();
while( IsRunning );
// When the above while() exits, the ThreadedAction is done.

This is a very simple threading design, but it's mostly drawbacks and not many advantages. We've continued to use it up to now in PCSX2's EEcore because there wasn't much reason to do away with it; and with the EEcore being the main orchestrator of everything in PCSX2 (gui included!), having the high-resolution responsiveness of a spinwait made some sense.

On the current design in 0.9.6, the EEcore and the GUI share time on the same thread, and when the GS thread is busy, the EEcore will split time between waiting on the GS (via the spinwait) and processing GUI messages. This transition from EEcore emulation to GUI message processing was typically costly, but was necessary to handle input from the user. In the new 0.9.7 design, the EEcore has its own thread separate from the GUI. This allows us to remove the overhead of having to switch to/from GUI processing code, but it came with a somewhat unexpected drawback: the EEcore's aggressive spinwait is now suddenly very aggressive when a game becomes GS-limited. In 0.9.6 the spinwait breaks from time to time to splice in some gui messages and make time for the GS to do its thing too. This kept everything pretty happy. But with the splicing gone, the EEcore is allowed to run free, soaking up tons of resources simply re-testing the same variable over and over.

The full impact became obvious when we realized that setting 2 software threads in GSdx caused PCSX2 to slow to a crawl (sub 1fps!) on dual core systems. That's what happens when you have three threads using spinwaits on a dual core system -- they completely starved out everything else, and to some extent each other as well. (yes, GSdx software uses spinwaits also!)

The primary solution is to get rid of the spinwait in the EEcore. In its wake we'll put the EEcore to sleep and have it wake up only once the MTGS ringbuffer has emptied to a satisfactory percentage. With the EEcore asleep, the GSdx thread(s) will have full reign over all the resources of the cpu; which will allow it to play "catch up" more efficiently than it can even in 0.9.6. This model will be an obvious win for both software rendering, and possibly DX11's multithreaded pipeline in the future.
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply

Sponsored links

#2
I really don't know what a lot of that means, but I'm guessing that means you've found ways to make PCSX2 run games faster and better. You can rebuild it, and slowly but surely we've gained the technology. Tongue

This makes me a happy individual, I appreciate all the hard work you guys have put into this, PCSX2 rocks! Laugh
AMD Phenom II 965BE @ 3.4Ghz
8 GB DDR3 1333 RAM
AMD Radeon HD 6750
Windows 7 64 bit
Reply
#3
Well, maybe. Because the GUI runs on a separate thread now it's possible for the core emulator to "starve" it out a bit more when under heavy load which, in a way, can be better for performance "when it matters" (when load is high)... but probably not in a significant way. Wink
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#4
Why is it that when I used task manager to check the thread count for pcsx2 it says that there are like 8 threads? Is that an accurate way to check to see how many threads a game uses? Most normal pc video games I run say that they are running well over 20 even though most games only benefit from 2 cores.
Reply
#5
Yes, it's accurate. For more in-depth information about the specific threads running under a process, you can use Process Explorer (disclaimer: it's filled with technical jargonie-goop!)

The thread count of PCSX2 can vary depending on what version of Windows you run and what SPU2 driver you're using. For example under Windows XP the XAudio2 Engine runs like 3-4 threads all by itself. These threads don't really improve performance or anything... they're just a way the XA2 engine encapsulates its KMixer sub-systems. I'm not sure if it's as thread happy on Vista/Win7.

Commercial games typically have higher thread counts because of the 3D engines they're based on are fairly advanced. There are three things most commercial 3D engines provide to the programmer:

* Fully threaded audio, which is another thread on top of whatever threads XA2/KMixer creates.

* Garbage Collectors (even if they aren't Java or C#/.NET based). GC's typically run 1 thread in the background for every private heap, and a "smart" engine typically has a couple private heaps to avoid allocation interlocking and/or memory resource contention.

* A generic Thread Pool Service -- which is typically 8 or 16 threads which are available for any "odd jobs" that can be executed in parallel to the main thread.

In most cases games hardly use the Thread Pool at all. There's only a few things it's useful for in games, and typically those are managerial or GUI-related -- nothing affecting game performance when it matters. So those 8-16 threads just sit there and sleep their days away, waiting for a task to be queued. The reason for the Thread Pool concept is that creating and destroying threads is very slow, while sleeping threads have nearly no overhead at all, so you're always better off creating the Pool once when the program/game starts and just letting them sleep in the background until needed (if needed).
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#6
Are you loosing any responsiveness in the trade from spin waits to event wakes. I'd assume any loss is made up by cpu cycles saved not using spin-locks though.
Reply
#7
Thanks to the MTGS design, probably not losing any responsiveness at all, actually.

The only time the EEcore will enter a sleeping state is when a game is GPU-limited, and for that to happen the MTGS's 8 meg ringbuffer needs to be completely filled, which is the point when the EE says "woops! I'm stuck!" The MTGS will in turn send a "wake up!" signal to the EEcore when it processes roughly 2-4 megs of that (actual value pending benchmarks), giving the EE plenty of time to wake up and start re-filling the MTGS buffer long before the MTGS runs out of things to process.

So yeah, it should be a big win, overall. The only added overhead will be in the MTGS loop, which will need an additional conditional check to see if it's time to wake up the EE or not, and that should be well out-weighed by the benefit of not having the EE burning cycles mindlessly.
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#8
Oh and it'll be a performance win in another way too -- it should help improve the "burst" performance of the EEcore thread. Bursting is a method of improving thread/cpu cache efficiency by giving a thread a good solid chunk of work to do before waking it up. Currently the aggressive spin EEcore gets back to work filling the MTGS with data the first instant it can. Usually this means only 1-3 packets of data (sometimes emulating as few as a dozen MIPS cpu cycles) before the EE stalls again. That makes for a really inefficient processing of those few cycles.

So in the new model, the EE will have lots more work it can do before it stalls again, so the L1/L2 caches of the CPU will be much more effective over that duration.
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#9
What will this likely mean for users with dual-core machines like myself, in your professional opinon? Better performance? Worse? Or no noticeable difference either way?
Reply
#10
This thread resulted in me opening 8 wiki tabs, finding some new programming blogs to bookmark, and learning some cool things about threading models. Thanks!
Endlessly inspired by emulation
MMA t-shirts
Reply




Users browsing this thread: 1 Guest(s)