Some questions about the technical side of PCSX2
#1
For a while now, PCSX2 seems to have hit a snag as far as utilizing PC hardware. This is most probably due to the fact that the clock frequency barrier is not likely to go away, and applications now must rely on threading practices to gain higher performance. I'm sure PCSX2 is no exception to this unwritten rule.

I know of a few methods that could make PCSX2 work more effectively, and I'd like to know which can be done, or which are impractical for emulation purposes. I'm certain you already know what these are, I'm just adding detailed info to clarify, just in case I get some terminology wrong.

Hierarchical worker threads - One master thread (The PCSX2 program) controls multiple director threads to do certain tasks. One director thread would be responsible for overseeing translation threads, while another director thread would be responsible for overseeing plugin threads. Yes, I know this is a simplistic view.
Pros: More threading allows for more CPU utilization on multi-core systems
Cons: Not as efficient on CPUs with only two cores

Pre-Processing or Pre-Rendering - The output of the emulation is delayed by a certain amount of "frames" to take advantage of cycles that would have been discarded for the sake of the frame limiter, resulting in smoother performance.
Pros: Increased performance, better frame limiter control
Cons: Input lag, more resource handling

There are a couple other methods which might help, but they escape me at the moment. However, there are some questions I have regarding the emulation itself.

As I recall (I don't recall when), it was stated that EE, IOP, and VUs were all on one thread, and splitting them would require a total re-write of the emulator. Is this still the case?

Also, does an HLE solution appear to be within reach?


Please excuse these questions, as I am sure you guys are tired of seeing them. Anyway, I'm quite impressed with the progress that has been made. Keep up the good work Smile.
Reply

Sponsored links

#2
(09-21-2010, 11:24 PM)Scuddie Wrote: Hierarchical worker threads - One master thread (The PCSX2 program) controls multiple director threads to do certain tasks. One director thread would be responsible for overseeing translation threads, while another director thread would be responsible for overseeing plugin threads. Yes, I know this is a simplistic view.
Pros: More threading allows for more CPU utilization on multi-core systems
Cons: Not as efficient on CPUs with only two cores

Likely not doable. Most tasks in PCSX2 are inherently non-parallel; meaning that one task must be completed in order before the next can be started (the next task relies on results generated by the previous one). The only components that can be threaded with any real benefit/return are the VPU (VIF interface + VU1 co-processor), and possibly IOP. For the IOP to be threaded efficiently however it may need high level PS2 kernel replacements/handlers for the SIF.

Quote:Pre-Processing or Pre-Rendering - The output of the emulation is delayed by a certain amount of "frames" to take advantage of cycles that would have been discarded for the sake of the frame limiter, resulting in smoother performance.
Pros: Increased performance, better frame limiter control
Cons: Input lag, more resource handling

Already done. MTGS queues 2-3 frames by default. Queuing is somewhat glitched in 0.9.7 beta, and causes some bottlenecks on FMVs and some games that run at 30fps internally; this will be fixed in the next release.

Quote:As I recall (I don't recall when), it was stated that EE, IOP, and VUs were all on one thread, and splitting them would require a total re-write of the emulator. Is this still the case?

Yes. VUs can be threaded without too much trouble, but the actual gains depend on a lot of stuff. The PS2 is designed such that games can send large batches of VU micro-programs to the VUs via the VIF -- if the VUs are threaded but the VIF is still running on the EE thread, then the EE thread will still be mostly busy sending a long series of programs to the VUs. So in order for the threading to be effective, both VIF and VU (both together are referred to as the VPU, for Vector Processing Unit) must be threaded.

We tried threading VUs earlier, quite unsuccessfully. I will try threading VUs and VIFs together later on, I think; once the new unified DMA controller is implemented.

Threading the IOP is much more difficult, because the SIF interconnection is very fast on real hardware but very slow to emulate. Emulating all of the kernel semaphores and handshakes pretty well kills any parallel code execution. So for IOP to thread effectively we would need to have an HLE replacement for the SIF manager in the kernel (both EE and IOP sides). That way we could replace the emulated semaphores with native ones -- which would be dozens of times faster.

Quote:Also, does an HLE solution appear to be within reach?

HLE is doable. Its just going to take a lot time, backward engineering, and studying of various available information. Its not something any of the active devs are interested in at this time. Still too many other known bugs we're trying to address; and a lisst of unimplemented features as well. Unsure
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#3
Might want to add the other big performance problem that we'll need to tackle one day:
GSdx and its texture cache.
There's several games that require ungodly amounts of vga memory just to do one seemingly simple effect.
With the current system we'd need 2GB vga ram just to cover most of this. I guess that at this point some 32bit
address space issues come into play as well.

According to Pseudonym who knows best about GSdx's problems right now, a new system would have to get
information about resources (textures, primitives) from PCSX2 itself..
Reply
#4
(09-22-2010, 03:00 AM)rama Wrote: According to Pseudonym who knows best about GSdx's problems right now, a new system would have to get
information about resources (textures, primitives) from PCSX2 itself..

Wouldn't this be a bad way to go? As far as I understand it the Vu's don't only handle graphics ops. Thus there would be some kind of system needed to check for certain ops likely through compares and then sending the information to the gpu if required. Doing multiple compares per op seems like it would comply bog down the system. Unless of course you are already doing the compares to begin with to handle the op codes. Then it might be a gain depending on the Gpu's abilities to process the code. However Gsdx becomes a lot more complex though that might be a speed win if complex effects on the PS2 can be paired down to something simpler in shader code. However the PS2 abused state changes and unless you can eliminate that it's really going to hurt a modern Gpu.

In all it seems like since 2GB Gpu's are starting to appear although they are rare it might be a better option to wait for more ram. Even though processing on the Gpu might callow more quality and speed gains eventually through replacement.
Reply
#5
The texture cache isn't about geometry, and is generally not related to VU processing.

The trick is understanding how the PS2 was designed: the GS has very little video memory but has tons of memory bandwidth with both main memory and the VUs. So PS2 games have to do a lot of texture swapping, where they continuously re-upload textures during the process of rendering a scene. Some games transfer upwards of 400 megabytes of texture data per second.

Furthermore, most of that texture data is in a bizarre GS-native configuration that no modern GPU supports, so the data has to be converted into something our modern GPUs can cope with. That's a lot of data to have to process every second.

The idea we developed would try to eliminate overhead by detecting texture re-uploads and ignoring them. Basically, when texture data is sent to the GS we'll check the source address (which resides in PS2 main memory), and if we already uploaded it earlier we'll skip it (avoiding both lots of memcps and lots of format conversions). The virtual machine can track any memory modifications made, so we'll know whatever data is stored at that address is the same as the stuff we uploaded earlier -- no mem compares needed.

Quote:In all it seems like since 2GB Gpu's are starting to appear although they are rare it might be a better option to wait for more ram. Even though processing on the Gpu might callow more quality and speed gains eventually through replacement.

No dice. The only reason most of these problematic games "max out" at 2gb is because that's where DirectX runs out of memory and forces GSdx to try and cope (or crash). Give it a card with 2gb of texture ram and you'll just see your memory use go a bit higher before the system starts to lock up. Wink
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#6
Ahh so it's tends to act more like a memory leak then an actual physical limit from what I am getting by that. Either that or the limit ends up being high enough it will not be reached by videocards anytime soon. And that's good the less compares the better also it seems like it would become a better solution as video ram grows as well as you can enlarge the vram buffer progressively over time till you find the limit of diminishing returns.
Reply
#7
Thanks for the detailed explanations Smile.

(09-22-2010, 02:44 AM)Air Wrote: Likely not doable. Most tasks in PCSX2 are inherently non-parallel; meaning that one task must be completed in order before the next can be started (the next task relies on results generated by the previous one).
I can see this as an absolute truth for translating individual operations from PS2 to x86 instructions, but what about the PS2 instructions themselves? I would imagine there are several operations running in parallel on PS2 hardware, otherwise it would constantly be starved. If the VM can identify which memory addresses are to be changed, it could be plausible to translate different instructions on different threads, synchronized to timestamp or execution order.

One really bad not-even-pseudocode-worthy example:
1 addr.381 = addr.381 * addr.471
2 addr.920 = addr.721 / addr.381
3 addr.471 = addr.503 + addr.870
4 addr.619 = max(addr.381 - addr.370, 0)
5 addr.745 = addr.619 / addr.471

Instruction 1 is required for instruction 2 and 4.
Instruction 2 depends on instruction 1, and is not required for any instruction.
Instruction 3 has no dependency, but is required for instruction 5.
Instruction 4 depends on instruction 1, and is required for instruction 5.
Instruction 5 depends on instruction 3 and 4.

By looking at the above, the main thread would have to translate instruction 1 first, then a worker thread can translate instruction 2 via RPC, while the main thread calculates 3, 4, and 5. Once the worker thread is done translating instruction 2, it's sent back to the primary thread.

While only one instruction is done in parallel in this example, it can reduce the time to emulate by as much as 20%. If there was a way to intelligently schedule tasks in this manner, I think it would be a decent benefit. Granted I know next to nothing about low level machine emulation, I'm fairly sure this might be an avenue worth exploring.



As for the texture memory, would it not be an option to save DDS files in a texture cache directory once they are converted from native PS2 format? That way, it could at least ease up on the overhead of translation.
Reply
#8
Nah, you almost lose the perspective. There is no such thing as "translating an instruction". Well, interpreter do it, but it so slow, and users not supposed to use it. Recompiller do another thing: it transform a whole programm block in PS2 into PC block. The most wonderfull thing is that we could reuse allready translated blocks without retranslation. Well, could we do a translation in time of compilled code run? Sorry, no. Until previous code would not stop, we simply have no idea, what code is need for translation. We simply does not know what address would be calculated in PS2, when next block would be needed.

Second question is -- could several translated code block be run in parallel? Theoretically yes. And GS plugins run in separate threads. But! You should understand, that recompilled code is the (first) time-critical part, it does not written by a programmer, rather automatically produced by translator (second) and need exactly synchonisation of all "virtual" PS2 devices (third). So for isolation of, for example, VU1, you need to push in translated code synchronization instructions, keep them unerroneous (otherwise race conditions are becoming our best friends forever and race is worth case of error -- unpredictable and hard-to-duplicate crush). And this code blocks would be written in almost the assembler.

Texture cache... Man, did you ever try to perform a fragment transformation outside videocard? GS-software do all this kind of word and it's slow. We need to do not only texture mapping (to videocard memory), but also resolving (got texture from videocard and send it back on PS2 memory). For further transformation.
Reply
#9
@Scuddie:

Furthermore, the whole concept of micro-program execution on Intel-style CPUs generally fails. Intel's multicore system is not at all suited for executing short tasks in parallel, because there is no programmable DMA to use for setting up batch operations, and there are very few synchronization tools for fast cross-thread communications. For example, the overhead of packaging and transporting an RPC to another thread well out-weighs the cost of simply translating/executing the instruction on the current thread. In order to multithread tasks with any actual speed gain, you have to be able to package and transport rather large tasks to other threads. As a general rule, the tasks being sent to other threads need to take at least 25,000 cycles or so in order to out-weigh the cost of RPC overhead.

(one alternative is to dedicate a single core to be a thread manager, that constant spins on a set of condition variables rather than relying on CPU and operating system-level sleep/wakeup interrupts. You can have much more responsive RPC that way, but then you lose a core and, on the latest batches of CPUs, could cause the CPU to downstep to slower overall cycle rates).

Meanwhile, it takes only a couple dozen cycles to translate instructions; so there's just no way it can be spread across multiple cores, even if you did use the dedicated spinning-core strategy.

Interestingly, on something like the Cell CPU you could employ the SPEs to process instructions in parallel, using the analysis you described. One could build and cache SPE execution chains that translate and execute series of instructions in parallel. Unfortunately, it would still pale in comparison to the speed of binary translation (aka, recompilers); which themselves cannot be effectively multi-threaded at any level anyway. So yeah, emulators rarely scale well to modern multi-core cpu designs.
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply
#10
[quote='Air' pid='136209' dateline='1285116267']Already done. MTGS queues 2-3 frames by default. Queuing is somewhat glitched in 0.9.7 beta, and causes some bottlenecks on FMVs and some games that run at 30fps internally; this will be fixed in the next release.[quote]

Is there any way to decrease or at all eliminate the input lag, or is that not an intended behavior? I'm using r3878 on a Core i5 and Windows 7 and have no problems with performance in general, other than occasional freezes or some such, but noticeable input lag makes certain games (DoDonPachi DaiOuJou among others) very frustrating to play. Note that the behavior I'm talking about is present regardless of Vsync (I never use it anyway; Aero is also disabled) and speedhack settings, and pretty much exactly corresponds to the 2-3 frames you're talking about. For dynamic games with automatic progression that's apparently a lot.

Also, where do I report compatibility status for games not present in the list?

Thanks in advance.
Reply




Users browsing this thread: 1 Guest(s)