[blog] Events o' Plenty
#1
One of the less obvious things that has plagued Pcsx2's compatibility over the years is its event handling system. The system in place as of 0.9.6 is adequate for interpreter-based emulation but is not well-equipped to handle the methods that a recompiler uses to manage cpu cycle updates. This is something we aim to fix in the coming weeks.

Cycle-based Timing Explained

All cpus have a cycle rate, which is typically the Mhz/Ghz values you're most familiar with when talking about any cpu. An i7 clocked at 2.83ghz has a 2.83ghz cycle rate. Now, the actual throughput of instructions can vary greatly since each cycle of the cpu consists of several stages and multiple piplines, each of which can have dependency stalls and has varying rules for when such stalls occur. The cycle rate, however, is always 2.83ghz. Because cycle rates are a known constant, they make a good barometer for synchronizing the activities of a multi-processor design like the Playstation 2.

Why do Recompilers Complicate Event Testing?

Recompilers work as a significant speedup over interpreters by doing two things:
  • Recompile the machine code of a emulated CPU (in our case MIPS instructions) into code native to the host machine (ix86 instructions).

  • Prefetch and pre-decode emulated instructions, and inline them into blocks.

The thing recompilers are most well-known for -- recompiling to native machine code -- is actually the less effective of the two things recompilers do for speeding up emulation. The primary speedup typically comes from the prefetching and inlining of instructions, which in addition to eliminating the instruction fetch/decode stage (by far the slowest part of any interpreter), also allows for cross-instruction optimizations such as constant propagation and register caching/mapping. In other words, a recompiler is effectively executing emulated instructions in pre-compiled bursts. This is so important to performance that a recompiler without block-level execution would hardly be any faster than an interpreter.

As part of the design of block-level execution, the recompiled code only updates cpu cycle counts and tests for scheduled events at block boundaries. Blocks typically span 5 to 35 cycles, but in some cases can span a hundred cycles or more. When the subsequent Event Test is performed, several scheduled events may be pending execution. This is where problems can occur: The current event system implemented into Pcsx2 executes all pending events in no particular order, leading to events being executed out-of-order when multiple events time-out during a single block. Typically most events don't have dependencies on each other, or games don't use them in a way that execution order matters. But sometimes they do, and in those cases behavior can be unpredictable, or can cause the game to fail outright. To make matters worse, the pending events typically don't know how late they are, and will re-schedule subsequent events in increasingly belated fashion. The current implementation of EE and IOP counters have tons of complicated code meant to compensate for this limitation (both slow and were nearly impossible to get right).

The fix for this is to use an event system I'll call decremental delta time. It has three advantages:
  • Makes it easy to execute events in scheduled order regardless of the amount of time which has passed since the last Event Test.

  • Maintains relative cycle scheduling at a high level so that none of the events being re-scheduled "lose time" due to belated block-boundary event testing.

  • Simplifies event handling on all levels, and provides significant speedups for event testing and event dispatching.

It's hard to know beforehand just how beneficial in-order execution of events will be. I'm anticipating that it might actually fix a few emulation problems on the IOP recompiler in particular, since it has a slow cycle rate and also has a handful of events which can have potential inter-dependencies. For that reason I'll be implementing the system first into the IOP, and then when all the chinks in its armor are worked free we'll port the EE side of the emulator over to it.
Jake Stine (Air) - Programmer - PCSX2 Dev Team
Reply

Sponsored links

#2

(06-02-2009, 12:25 PM)Air Wrote: One of the less obvious things that has plagued ...

Hi, I just have some questions if u don't mind.
1. "the recompiled code only updates cpu cycle counts and tests for scheduled events at block boundaries, ... When the subsequent Event Test is performed, several scheduled events may be pending execution. ... if events have dependencies on each other, behavior can be unpredictable", so u propose a method called "decremental delta time", I found when DMA channel raise a event, it will call:
Code:
__fi void CPU_IN( EE_EventType n, s32 ecycle) {
cpuRegs.interrupt   |=  1 << n;
cpuRegs.sCycle[n]  =  cpuRegs.cycle;
cpuRegs.eCycle[n]  =  ecycle;
...
If (g_nextEventCycle – cpuRegs.cycle  >  cpuRegs.eCycle[n]) {
    g_nextEventCycle  =  cpuRegs.cycle + cpuRegs.eCycle[n]);
}
}
CPU_IN accepts the Event Type parameter and cycles to delay, and it will update the global g_nextEventCycle variable if it needs a less one, the g_nextEventCycle will be used when event test happens, it will test every DMA events:
Code:
static __fi void TESTINT( u8 n, void (*callback)() )  {
    if( !(cpuRegs.interrupt & (1 << n)) ) return;
    if( cpuRegs.cycle  -  cpuRegs.sCycle[n] >= cpuRegs.eCycle[n] )    
    {
        cpuClearInt( n );
        callback();
    }
    else
        cpuSetNextEvent( cpuRegs.sCycle[n], cpuRegs.eCycle[n] );
}
so, if cycles passed is larger than what we set in CPU_INT -- ecycle, just handle it, or we update g_nextEventCycle.
====================================
1. DMA events have different ecycle, range from 1 to 128. ie.CPU_INT(DMAC_MFIFO_GIF, 16), Why using different ecycle for them? I mean, how to decide those ecycle values?
====================================
2. if the recompiled block spans hundreds of cycles, I think events raised all will be handled in subsequent test event (possible, right?). Why u said
Quote:Makes it easy to execute events in scheduled order regardless of the amount of time which has passed since the last Event Test.

Thanks
3. Actually, Im confused with ur decremental delta time method, would you mind to expain it in detail? really thanks
Reply
#3
1. The ecycle used on the DMA's, especially at the smaller values are either fixed due to us knowing how much processing needs to be done or just to trigger it to process more DMA information after a short time, all this enables us to do is update the g_nextEventCycle so the recompiler knows it has to jump out and let the interpreted stuff run.

2. I believe there is a check in the recompiler to see if the event test cycle hasn't been passed, else carry on processing recompiled blocks, the problem with recompilers being that you can't just jump out in the middle of a block as it's direct assembly to the cpu, with the older recompiler (and many simpler ones) it has a fixed amount of time then it comes out of the recompiler and checks other stuff, our rec has been designed so you don't have to do this as entering and leaving the recompiler is very expensive.

3. What im assuming he is referring to is the recompiler figures out how much time a block will take and checks the amount of time left before the next scheduled event is to happen, so this way we can trigger it early, which can be better than triggering it late. Don't quote me on this though, im not quite sure that's what he is referring to Tongue2
[Image: ref-sig-anim.gif]

Reply
#4
Also Jake has moved on and can't reply anymore, very unfortunately for everyone here Tongue2
Reply
#5
What a pity! I like to read Air's article, he is always willing to share his experience to pcsx2 fans.
Reply
#6
Thanks Refraction, so back to question 1, u said "values are either fixed due to us knowing how much processing needs to be done" (sorry for quoting u Smile), I think the ecycle value is set for eeCore, cuz when current DMA operation triggers an event, it just returns to eeCore and wants to re-enter into DMA operation after ecycle time passed, so, what do u mean u developers know how much processing needs to be done, the cpu needs to do or DMA needs to do? IMO, maybe some DMA operation need to do a lot of work, especially with several DMA tags to be handled, so the emulator just handles a few, then it will trigger a event, the left part of DMA tags will be handled at subsequent event test stage, so eeCore will execute its own code at least ecycles before enter into subsequent event test stage. um, I wonder for large ecycle value, why dont we just set ecycyle = 0 for every event, then they all will be handled in subsequent event test. if the reason is "some events have dependencies on other events, so we set different ecycle values", then how do u find or understand the dependencies between all events, and quantisize their time dependencies?
Reply
#7
(08-22-2013, 04:38 AM)Eison Wrote: Thanks Refraction, so back to question 1, u said "values are either fixed due to us knowing how much processing needs to be done" (sorry for quoting u Smile), I think the ecycle value is set for eeCore, cuz when current DMA operation triggers an event, it just returns to eeCore and wants to re-enter into DMA operation after ecycle time passed, so, what do u mean u developers know how much processing needs to be done, the cpu needs to do or DMA needs to do?

Basically there are some functions we know are going to take a set amount of time, be it a DMA tag process or waiting in a tight loop for another DMA to finish. With normal processing we work out how long it will take based on the size of the DMA packet (the tag tells us this) and we can set the number of eCycles accordingly so we don't constantly tell the recompiler to stop.

(08-22-2013, 04:38 AM)Eison Wrote: IMO, maybe some DMA operation need to do a lot of work, especially with several DMA tags to be handled, so the emulator just handles a few, then it will trigger a event, the left part of DMA tags will be handled at subsequent event test stage, so eeCore will execute its own code at least ecycles before enter into subsequent event test stage. um, I wonder for large ecycle value, why dont we just set ecycyle = 0 for every event, then they all will be handled in subsequent event test. if the reason is "some events have dependencies on other events, so we set different ecycle values", then how do u find or understand the dependencies between all events, and quantisize their time dependencies?

Timing of the DMA is a big pain in the ass, some games are very reliant on this (burnout, need for speed most wanted, the punisher are just 3 examples) so unfortunately we have to simulate how long it would really take, it would be nice if we could set it to zero, but it will never work.

In relation to the dependancies comment, this kinda relates to those games i mentioned. There is a function called Path3 Masking, which essentially relies on synchronising the GIF with the VIF so the packets transfer from each in the correct order, get this wrong and you get textures in the wrong place or just complete freezing as the program is expecting more dma stuff to happen in the order it wants and it will refuse to do so.

For further information on Path3 masking, i did a blog about it here: http://forums.pcsx2.net/Thread-blog-Path...ry-Syncing
[Image: ref-sig-anim.gif]

Reply
#8
Hi, devs, I have a kinda stupid question, why eeCore and IOP should be synchronized? here in the test event:
Code:
    EEsCycle += cpuRegs.cycle - EEoCycle;
    EEoCycle = cpuRegs.cycle;

    if( EEsCycle > 0 )
        iopEventAction = true;

    if( iopEventAction )
    {        
        EEsCycle = psxCpu->ExecuteBlock( EEsCycle );
        iopEventAction = false;
    }
eeCore and IOP are using EEsCycle to synchronize each other, when eeCore is executing, it will add cycles on EEsCycle, while IOP is executing, it will minus cycles on EEsCycle, here in test event func, eeCore decide whether to let IOP continue. I wonder why eeCore not directly call IOP->executeBlock() without checking out EEsCycle? I think it may be the frequency between eeCore and IOP are totally diferenct, one is 294MHz, and the other is 34MHz, but I try to delete EEsCycle's checking out, it works well (just test Final Fantasy X). So why intro EEsCycle?

and, when eeCore updates EEsCycle, it usually add inst.cycle (> 7), while IOP updates EEsCycle, it usually minus 1, so it's weird too.

moreover, why not intro EEsCycle implemention for VU1? now VU1's executing is just like that:
Code:
test_event() {
...
VU1->executeBlock();
...
}
Thanks!
Reply
#9
The sync code got a bit convoluted when it didn't work as expected and Air had to fix it.
It's possible not everything works as he had originally planned.

VUs are decoupled from the rest of the system in our design for a couple of reasons.
They are meant to "one shot" the calculations and return control to the EE.
We then add some timing by delaying VIF, for example.

If you want to focus on timing, it's best if you just look at the EE/IOP sync.
Reply




Users browsing this thread: 1 Guest(s)