[blog] Introduction to Dynamic Recompilation
#1
This blog post is an introduction to dynamic recompilers (dynarecs), and hopes to provide some insight on how they work and why pcsx2 uses them to speed up emulation.
It is probably easier to read on our forums, because some of the code didn't wrap nicely on our main blog page....
(Click here to view blog post in forum)

To first understand why dynarecs are useful, you must first be familiar with a basic interpreter emulator.

Assume we are emulating a very simple processor. Processors have instruction sets which are a set of different instructions they can compute.
Lets assume the processor we are emulating is a made-up chip I'll call SL3 (super lame 3), and has only these 3 instructions (and each instruction has fixed width of 4 bytes):

MOV dest_reg, src1_reg // Move source register to destination register
ADD dest_reg, src1_reg, src2_reg // Add source1 and source2 registers, and store the result in destination register
BR relative_address // Branch (jump) to relative address (PC += relative_address * 4)

Processors generally have what we call registers which can hold data, and the processor's instructions perform the operations on these registers.
For our example, we will assume that our SL3 processor has 8 registers, and the size of these registers is 32 bits (so each register holds 32 bits of data).

Now to program for this processor, we can have the following code:

Code:
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5

What this code does is:
1) It moves register 0 to register 1 (so now register 1 holds a copy of register 0's data).
2) It adds register 2 and register 3 together, and stores the result in register 4.
3) It branches 5 instructions further away (so now it jumps to some code that is further down (not shown in above example))

So that is how we can program for the SL3 processor in assembly code. But how do we emulate it?

To actually emulate this processor we can use an interpreter. An interpreter simply fetches each instruction opcode and executes them accordingly (e.g. by calling emulated methods for each different instruction). The rest of the emulator (emulating other processors/peripherals of our system) can then get updated sometime in between instructions or after a group of cpu instructions are run. Interpreters are a simple and complete way to emulate a system.

(Click here to see a C++ code example of a simple interpreter)

Using interpreters we constantly have to be fetching and executing instructions one-by-one. There is a lot of overhead in this, and minimal room for optimization since most special case optimizations will have the overhead of checking for them (so it will for example add extra if-statements and conditionals... reducing the gain from the optimization). But there's a faster way to do processor emulation which doesn't have these draw-backs... using dynamic recompilation!

The basic idea of dynamic recompilation is to translate emulated instructions once, cache the emitted translated instructions, and then run the emitted native instructions as much times as needed.

Since the instructions you read are not changing (lets leave out self-modifying code for this example), you can translate the emulated instructions into native cpu instructions (in our case x86-32 asm instructions) and then cache the translated instructions into 'blocks' of code, then just execute these blocks of native code instead of having to do the translation over and over again whenever the code needs to be read again.

So for instance remember our above SL3 program:

Code:
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5

Lets assume this code is part of some function and gets called 100's of times a second (this could sound crazy, but games/applications commonly call the same code hundreds or thousands of times a second).

Now our runCPU() interpreter function above will have to translate every instruction before it can actually compute the result.

That is, it needs to fetch the opcode of every instruction, call the emulated function based on the opcode number, then actually compute the instruction result.
But dynarecs allow us to skip the "fetch opcode" and the "call the emulated function" part, by only doing this once, and then caching the translated code into native asm blocks.

To make a good dynarec, we first need a code emitter.
An emitter is a series of functions we call that write native asm to some memory block we give it.
So we use an x86-32 emitter to write native x86-32 asm code to blocks of memory, and then later we can execute these blocks as if they were normal c++ generated functions!

PCSX2 has a very cool emitter that looks very similar to x86-32 assembly, except the instructions have an 'x' before them.
So for example:
mov eax, ecx;
is
xMOV(eax, ecx);
with the pcsx2 emitter.

Now the idea behind the dynarec we are going to write now, is that we will end blocks whenever a branch instruction is detected (which will jump to other blocks).

The code for actually recompiling these blocks looks something like this:

Code:
// This is our emulated MOV instruction
void MOV() {
    u8 dest = fetch(); // Get destination register number
    u8 reg1 = fetch(); // Get source 1 register number

    xMOV(eax, ptr[&cpuRegs[reg1]]); // Move reg1's data to eax
    xMOV(ptr[&cpuRegs[dest]], eax); // Move eax to dest register

    fetch(); // This fetch is needed because every instruction in our SL3 processor is 4 bytes
}

// This is our emulated ADD instruction
void ADD() {
    u8 dest = fetch(); // Get destination register number
    u8 reg1 = fetch(); // Get source 1 register number
    u8 reg2 = fetch(); // Get source 2 register number

    xMOV(eax, ptr[&cpuRegs[reg1]]); // Move reg1's data to eax
    xADD(eax, ptr[&cpuRegs[reg2]]); // Add eax with reg2's data
    xMOV(ptr[&cpuRegs[dest]], eax); // Move eax to dest register
}

// This is our emulated BR (jump) instruction
void BR() {
    s8 addr = fetch(); // Gets a number by which we will increment (or decrement if negative) PC by
    PC = (PC - 2) + (addr * 4);
    
    // Get a pointer to a block of x86-32 compiled code
    // that was recompiled by the recompileSL3() function
    u8* block_pointer = getBlock(PC);
                        
    xJMP(block_pointer); // Jump to the pointer returned by getBlock()
}

// This goes through instructions and recompiles them
// It recompiles instructions until it reaches a BR() instruction.
u8* recompileSL3(u32 startPC) {
    u8* startPtr = xGetPtr(); // Gets pointer to where the emitter is currently pointing to (the start pointer of the block)
    PC = startPC; // Set PC to the start PC of this block
    bool do_recompile = true;
    while (do_recompile) {
        u8 opcode = fetch();
        switch (opcode) {
            case 0: MOV(); break;
            case 1: ADD(); break;
            case 2: // We stop recompiling on branches
                BR();
                do_recompile = false;
                break;
        }
    }
    return startPtr; // Returns the pointer to where our block of x86 generated code starts at
}

// This holds all the pointers to our blocks that were recompiled based on
// starting PC address. We will assume that the instruction memory for
// this processor is 16kb, which means that it can hold at-most 1024*16 bytes
// worth of instructions. And therefor we we have at-most 1024*16 block pointers.
static u8* blockArray[1024*16];

// This returns a pointer to our recompiled block
// If it hasn't been compiled, it'll recompile the block and then return that pointer.
// We use __fastcall because it lets us pass the startPC parameter in the ecx register
// instead of having to use the x86 stack...
u8* __fastcall getBlock(u32 startPC) {
    if (blockArray[startPC] == null) {
        blockArray[startPC] = recompileSL3(startPC);
    }
    return blockArray[startPC];
}

// Basic cpu emulator using dynamic recompilation
void runCPU() {
    // This sets our emitter to start emitting instructions to rec_cache
    // which is a large block of memory where we can write lots of
    // x86 asm instructions to...
    x86setPtr(rec_cache);    
    
    __asm {
        pushad; // Save all our registers
        mov ecx, PC; // Move PC parameter into ecx register (for __fastcall)
        call getBlock; // Call the getBlock function
        jmp eax; // The block to jump to is returned in eax
    }
}

Note the above code doesn't have any logic to successfully exit once it starts executing recompiled blocks... I left this stuff out in order to not complicate things... so assume that somehow execution ends and we can get back to running the other parts of the emulator...
Also note we need to restore registers when we exit execution, and we also need to set rec_cache to the new x86 emitter address (so it can continue where it left off instead of writing over already recompiled memory blocks).
(Click here for a more complete sl3 recompiler code with cycle counting and exit logic)

Now how does this work?
Well we basically call runCPU() which calls the getBlock() function with the original PC value.
getBlock() then checks if we have already recompiled a block with that start PC value, which we havn't yet, so it will go on to call recompileSL3() and give it the startPC value.
recompileSL3() will loop through the opcodes, fetching them and then calling the appropriate emulated functions which will write to memory x86-32 asm instructions computing the correct results.
recompileSL3() will continue looping until it reaches a BR() instruction (which will recursively call getBlock() until all the addresses have been recompiled and no more recompilation needs to happen).

Once everything has been recompiled we jump to the starting block's execution address, and that's how we actually run the execution.
The code that ends up being executed after we recompiled is only the emitted code, which were the functions prefixed with 'x' (xMOV, xADD, etc...).
Notice that that's a lot of code omitted as we don't have to fetch anything anymore, but instead just run a series of fast generated asm functions...
This means that although we have a lot of extra overhead on the actual recompilation, we end up generating some really fast code, that could be called 1000's of times a second, therefor making the initial overhead well worth it!

We can also perform many optimizations while recompiling such as regalloc, constant propagation, optimized asm generation for special opcode cases, etc... (we didn't do this stuff in our example to avoid complexities).


Hopefully this article has helped you gain a bit of understanding on how dynamic recompilation works, and why we use it in pcsx2 to gain a lot more speed over interpreters!
Also note that dynarecs can be used for more than processor emulation, we also used dynamic recompilation in pcsx2 to cache asm optimized unpack routines that the ps2's vif does.

I should also add that currently pcsx2 has the following dynarecs:
EE recompiler (for MIPS R5900 ee-core processor)
IOP recompiler (for MIPS R3000A i/o processor)
microVU recompiler (for VU0/VU1, and COP2 instructions)
Super VU recompiler (can be used instead of microVU for VU0/VU1)
newVIF unpack recompiler (recompiles vif unpack routines)
r3000air (not yet finished, but should one day supersede the IOP recompiler)
Check out my blog: Trashcan of Code
Reply

Sponsored links

#2
This is a pseudo-sample of an SL3 Interpreter, as described above. It emulates the cpu by performing the same logic of the cpu, one instruction at a time.

Code:
// What this does is grab a byte (8 bits) from cpu memory
// and increments PC by 1.
// PC stands for "program counter" and is something processors use to
// know where in memory they should fetch the next instructions from!
u8 fetch() {
    return cpuMemory[PC++];
}


// This is our emulated MOV instruction
void MOV() {
    u8 dest = fetch(); // Get destination register number
    u8 reg1 = fetch(); // Get source 1 register number

    cpuRegs[dest] = cpuRegs[reg1]; // Perform add operation

    // This fetch is needed because every instruction in our
    // SL3 processor is 4 bytes...
    fetch();
}

// This is our emulated ADD instruction
void ADD() {
    u8 dest = fetch(); // Get destination register number
    u8 reg1 = fetch(); // Get source 1 register number
    u8 reg2 = fetch(); // Get source 2 register number

    cpuRegs[dest] = cpuRegs[reg1] + cpuRegs[reg2]; // Perform add operation
}

// This is our emulated BR (jump) instruction
void BR() {
    s8 addr = fetch(); // Gets a number by which we will increment (or decrement if negative) PC by
    PC = (PC - 2) + (addr * 4);    
    // The reason we subtract 2 from PC is because we have incremented PC
    // twice by calling fetch() once to get the opcode number (in runCPU())
    // and another time above to get the addr value.
    // The reason we multiply address by 4 is because each SL3 instruction is 4 bytes
    // so we need to multiply the addr value by 4 to move in full 'instruction' steps.
    // (If we didn't do this it would be possible to jump in the middle of an instruction!)
}

// Basic interpreter cpu emulator
// We first fetch an opcode number, and then based on that opcode
// number we run the correct instruction by calling the correct function.
// We do this in a loop 100 times, because it is faster to emulate a lot of instructions at once
// instead of calling 1 instruction at a time, and then updating the rest of the emulator every instruction...
void runCPU() {
    for (int i = 0; i < 100; i++) {
        u8 opcode = fetch();
        switch (opcode) {
            case 0: MOV(); break;
            case 1: ADD(); break;
            case 2; BR(); break;
        }
    }
}

// Program starts running here.
// We update the CPU, GPU, SPU, and read input in this loop.
// We exit the loop if 'isRunning' becomes false
// (which for example runInput() can set to false if it detects escape key is pressed)

bool isRunning = true; // Global variable to continue running emulation

void main() {
    while (isRunning) {
        runCPU(); // Run the cpu interpreter code above
        runGPU(); // Run graphic processing code (not shown)
        runSPU(); // Run sound processing code (not shown)
        runInput(); // Get Input (keyboard presses/gamepad...) (not shown)
    }
}
Reply
#3
Good article! Thanks cotton.
I have some question:
1. how does a block recompiled code organized in the cache? for example, consider
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5
the recompiled code is the continuous binary executable code or function pointer?
2. how does the recomiled block compute the executed cycles, to let other module to execute?
3. how does the interrupt mechanism work?
Core 2 Duo E5200 @ 2.5Ghz /ATI HD5770 /2GB RAM 800 @ 1066Mhz /Asus P5QL /Windows XP SP3
Reply
#4
(03-16-2010, 03:59 AM)whgwbn Wrote: Good article! Thanks cotton.
I have some question:
1. how does a block recompiled code organized in the cache? for example, consider
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5
the recompiled code is the continuous binary executable code or function pointer?
2. how does the recomiled block compute the executed cycles, to let other module to execute?
3. how does the interrupt mechanism work?

Good questions:

1) The recompiled code is just a bunch of asm instructions, you 'jump' to the starting address and then it runs your recompiled opcode instructions and jumps to other asm blocks.
There should be some exit-condition which will jump to an 'exit-cleanup' function that gets run before you fully exit the recompiled code, but I didn't show that code because it would've made the pseudo-code bigger and more complex.

the recompiled code looks like this for the mentioned sl3 code:
Code:
// Start of MOV
mov eax, cpuRegs[0];
mov cpuRegs[1], eax;
// Start of ADD
mov eax, cpuRegs[3];
add eax, cpuRegs[2];
mov cpuRegs[4], eax;
// Start of BR
jmp 0x10380488; // Address of another Recompiled Block!
.... More recompile code here (for other blocks) ...

2) The current pseudo-code doesn't compute executed cycles, but a real emulator recompiler should! (It was just simpler to leave out that logic from the example.. maybe I'll modify the code later to be more complete)

The basic idea is that you can compute the execution cycles of the block at recompile-time, then add the total cycles at the end of a block.
So for example, assume every sl3 instruction takes 2 cycles. Then for the above sl3 code, the block will be a total of 6 cycles (1 mov * 2 cycles + 1 add * 2 cycles + 1 br * 2 cycles = 6 cycles).
Now that you know the total amount of cycles for the block, you just ADD them to a cpu->cycles variable with an x86 asm emitted ADD like so:
xADD(ptr[&cpu->cycles], blockCycles); // blockCycles is equal to '6' here...

In a real recompiler what you can do is after you add the cycles for the block, you can check how much cycles have passed, and then exit the executing code after a certain amount of cycles have passed.
Like this:
Code:
xADD(ptr[&cpu->cycles], blockCycles); // blockCycles is equal to '6' here...
xCMP(ptr[&cpu->cycles], 512); // compare to 512 cycles
JGE32(exit_cleanup_method); // exit recompiled code if over 512 cycles have passed
... More recompiled code here ...

3) If you're talking about cpu interrupts, then you can check for them on branch instructions... which brings up the point that recompilers generally don't allow for cycle accuracy as you're limited to block granularity.
You can instead add checks every instruction if you want instruction-granularity, but this would of course be slower.

pcsx2 is currently very asynchronous with very bad cycle accuracy. The assumption is that any game that is coded well will not rely on cycle accuracy as its very unpredictable on the real ps2.
Check out my blog: Trashcan of Code
Reply
#5
Why the recompiled code is just a bunch of asm instructions, not the machine instructions? Does the cpu can execute the asm instructions directly?
Core 2 Duo E5200 @ 2.5Ghz /ATI HD5770 /2GB RAM 800 @ 1066Mhz /Asus P5QL /Windows XP SP3
Reply
#6
Asm instructions are machine instructions made readable. The names and syntax are there purely for humans reading the code, to the machine it's all the same.

Inside the emulator it isn't in assembly form, it's just a mechanism to understand what's going on. It's a lot easier to write assembly ( mov[source, destination] ) than machine code ( 0010 0000 0000 0000 0000 0000 0000 0000 )
"This thread should be closed immediately, it causes parallel imagination and multiprocess hallucination" --ardhi
Reply
#7
Yeh as echosierra said, the asm is just human readable machine language.

When you use pcsx2's emitter to generate x86-32 asm code, its really generating machine-code... which is a bunch of binary numbers...
Although your cpu is good at reading the 1's and 0's, its not so easy for a human, so we use asm to make it human readable Laugh

Also to avoid confusion, I should note that pcsx2's emitter uses the operand order:

opcode dest, source; // Destination first
instead of:
opcode source, dest; // Source first

So the same as Intel assembly syntax.
Check out my blog: Trashcan of Code
Reply
#8
thanks for the explanation.
Hope to see more document inside the pcsx2 Smile
Core 2 Duo E5200 @ 2.5Ghz /ATI HD5770 /2GB RAM 800 @ 1066Mhz /Asus P5QL /Windows XP SP3
Reply
#9
Since the above pseudo-code wasn't complete, I took the time to code a more complete recompiler for the sl3 processor.

Code:
// This is our emulated MOV instruction
void MOV() {
    u8 dest = fetch(); // Get destination register number
    u8 reg1 = fetch(); // Get source 1 register number

    xMOV(eax, ptr[&cpuRegs[reg1]]); // Move reg1's data to eax
    xMOV(ptr[&cpuRegs[dest]], eax); // Move eax to dest register

    fetch(); // This fetch is needed because every instruction in our SL3 processor is 4 bytes
}

// This is our emulated ADD instruction
void ADD() {
    u8 dest = fetch(); // Get destination register number
    u8 reg1 = fetch(); // Get source 1 register number
    u8 reg2 = fetch(); // Get source 2 register number

    xMOV(eax, ptr[&cpuRegs[reg1]]); // Move reg1's data to eax
    xADD(eax, ptr[&cpuRegs[reg2]]); // Add eax with reg2's data
    xMOV(ptr[&cpuRegs[dest]], eax); // Move eax to dest register
}

// This is our emulated BR (jump) instruction
void BR() {
    s8 addr = fetch(); // Gets a number by which we will increment (or decrement if negative) PC by
    PC = (PC - 2) + (addr * 4);
    
    // Get a pointer to a block of x86-32 compiled code
    // that was recompiled by the recompileSL3() function
    u8* block_pointer = getBlock(PC);
                        
    xJMP(block_pointer); // Jump to the pointer returned by getBlock()
}

// This adds the amount of cycles the current block
// has taken to cpu->cycles
void addCycles(u32 cycles) {
    xADD(ptr[&cpu->cycles], cycles);
}

// This emits code for exiting out of recompiled block
// execution, when 512+ cycles have passed...
void testCycles(u32 startPC) {

    // compare cpu->cycles to 512 cycles
    xCMP(ptr[&cpu->cycles], 512);
    
    // The code inside this jump statement will be called
    // if over 512 cycles have passed...    
    u8* jmp8 = JL8(0);

        // Save PC value before exiting
        // so that next time we call runCPU()
        // it can continue where it left off.
        xMOV(PC, startPC);

        // We can exit the recompiled blocks by
        // a RET instruction because we CALLed
        // the initial recompiled block...
        xRET();

    x86SetJ8(jmp8)

    // We will jump here if under 512 cycles have passed
}

// This goes through instructions and recompiles them
// It recompiles instructions until it reaches a BR() instruction.
u8* recompileSL3(u32 startPC) {

    // Gets pointer to where the emitter is currently
    // pointing to (the start pointer of the block)
    u8* startPtr = xGetPtr();
    
    // Set PC to the start PC of this block
    PC = startPC;

    // This is a counter for how much cpu cycles
    // the current block takes
    u32 blockCycles = 0;

    // At the beginning of every block we call
    // testCycles() which emits code to check if
    // over 512 cycles have passed.. and if so
    // it exits the recompiled block execution...
    testCycles();


    // Keep recompiling until do_recompiler is false
    bool do_recompile = true;
    while (do_recompile) {

        // Every sl3 instruction is 2 cycles, so add them here...
        blockCycles += 2;

        // Fetch next opcode
        u8 opcode = fetch();
        switch (opcode) {
            case 0: MOV(); break;
            case 1: ADD(); break;
            case 2: // We stop recompiling on branches
                addCycles();
                BR();
                do_recompile = false;
                break;
        }
    }

    // Returns the pointer to where our block of x86 generated code starts at
    return startPtr;
}

// This holds all the pointers to our blocks that were recompiled based on
// starting PC address. We will assume that the instruction memory for
// this processor is 16kb, which means that it can hold at-most 1024*16 bytes
// worth of instructions. And therefor we we have at-most 1024*16 block pointers.
static u8* blockArray[1024*16];

// This returns a pointer to our recompiled block
// If it hasn't been compiled, it'll recompile the block and then return that pointer.
// We use __fastcall because it lets us pass the startPC parameter in the ecx register
// instead of having to use the x86 stack...
u8* __fastcall getBlock(u32 startPC) {
    if (blockArray[startPC] == null) {
        blockArray[startPC] = recompileSL3(startPC);
    }
    return blockArray[startPC];
}

// Basic cpu emulator using dynamic recompilation
// This is the start function of our recompiler
void runCPU() {
    // This sets our emitter to start emitting instructions to rec_cache
    // which is a pointer to a large block of memory where we can write lots of
    // x86 asm instructions to...
    x86setPtr(rec_cache);

    // We initialize the cycles variable to 0
    // our recompiler will exit when cycles
    // is 512 or higher...
    cpu->cycles = 0;
    
    // We refer to the code below as
    // the recompiler dispatcher!
    __asm {
        // Backup non-volatile registers
        push ebp;
        push ebx;
        push esi;
        push edi;

        // Move PC parameter into ecx register (for __fastcall)
        mov ecx, PC;

        // Call the getBlock function
        call getBlock;
        
        // The block to jump to is returned in eax
        // so call eax which will start executing
        // the recompiled code...
        call eax;

        // The recompiled code will return here when
        // it reaches a "ret" instruction.

        // Restore registers
        pop edi;
        pop esi;
        pop ebx;
        pop ebp;
    }

    // This sets the rec_cache pointer to the end
    // of the last recompiled block... so that we can resume
    // emitting code there when runCPU() is called...
    rec_cache = xGetPtr();
}

The code above has the necessary logic to successfully exit block execution every 512 cycles.
That is, it runs the rec for 512 cycles, then exits so that the rest of the emulator can be updated.
Then it can be started again by calling runCPU();
Check out my blog: Trashcan of Code
Reply
#10
very nice code example, but one thing i have always wondered, is how would you handle the cpu flags? i mean i guess you could do them as you would in an interpreter but i assume there is a much better way in a well written dynarec?
Reply




Users browsing this thread: 1 Guest(s)