[blog] Introduction to Dynamic Recompilation
#11
(03-19-2010, 02:09 AM)hatorijr Wrote: very nice code example, but one thing i have always wondered, is how would you handle the cpu flags? i mean i guess you could do them as you would in an interpreter but i assume there is a much better way in a well written dynarec?

Well there is a big benefit to dynarecs regarding cpu flags now that you mention it.
Since you already know the instructions ordering for the block, you only need to update cpu flags when you have an instruction that will read these cpu flags!

This actually can get complex to detect and code a nice system (especially if your cpu is pipelined and can have multiple live-flag instances)... This was also the most complex part of my microVU dynarec for pcsx2...

Basically the idea is that to do it nicely you need to have multiple passes for your dynarec.
Pass 1 will analyze the dynarec block, and compute some extra information for each instruction. Stuff like whether or not their flags need to be updated, pipeline information, constant propagation information (if using const prop), etc...

Pass 2 will actually emit the x86 asm code using the information from pass 1.
So for example, assume the sl3 processor only has 1 flag it sets, and that flag is set if the result of the last calculation was negative (i.e. its sign-bit was set).

You can then have your opcode look like this:
Code:
void ADD() {
    u8 dest = fetch(); // Get destination register number
    u8 reg1 = fetch(); // Get source 1 register number
    u8 reg2 = fetch(); // Get source 2 register number

    xMOV(eax, ptr[&cpuRegs[reg1]]); // Move reg1's data to eax
    xADD(eax, ptr[&cpuRegs[reg2]]); // Add eax with reg2's data
    xMOV(ptr[&cpuRegs[dest]], eax); // Move eax to dest register

    if (IRinfo->updateFlags == true) {
        xSHR(eax, 31); // Shift right by 31 to get sign bit
        xMOV(ptr[&cpu->flags], eax); // Move Flag to cpu->flags variable
    }
}

IRinfo is a pointer to a struct with your computed opcode-information from pass 1.
IR stands for intermediate representation, which is what we call it when dynarecs computer extra information with multiple passes.

Since the total instruction memory for the sl3 processor was 16kb, we can implement an IR system like so:

Code:
struct recIR {
    bool updateFlags; // This instruction needs to update flags
    bool isBranch;      // This instruction is a branch instruction
    bool moreInfo1;
    bool moreInfo1;
}

// Our instruction memory is 16kb.
// For this example we are also assuming that the sl3 processor
// requires 4-byte alignment (because each sl3 instruction is 4 bytes)
// So we can divide 16kb by 4 to get the total number of
// recIR structs we need...
recIR IR[(1024*16) / 4];

// This is a pointer to a recIR struct for the current sl3 opcode-instruction
// Before recompiling an sl3 instruction, we should do something like:
// IRinfo = &IR[PC / 4];
// To set the IRinfo pointer to the correct instruction's recIR struct...
recIR* IRinfo;

When making a dynarec for a complicated processor, it is very nice to have an IR system that gives you extra information that you can use for optimizations, pipeline logic, etc...

Some dynarecs go one step further with IR's and also develop an IL (Intermediate Language).
I don't have experience with doing so, but the main benefit of an IL is that you don't have to change much of your dynarec in order to have it compile native code to other architectures... e.g. both x86-32 and x86-64 support...

IMO though, an IL can make a dynarec a lot more confusing for other people to read the code. Since your dynarec will have another layer of code before it emits the native code... it can easily bloat the code of your dynarec and get messy fast...
Unless portability is a high concern, or you can find some great benefit to an IL, then its probably best to not have one for simplicities sake.

---------

Oh also to get back on the subject of flag computations... I forgot to mention that if you do implement logic to skip over flag calculations, you should still be sure to update the flags before the end of a block.
Because its possible that the beginning of 1 block will read the flags from the end of another block.
You can also have logic to detect this case (my microVU rec does..) but it makes the dynarec more complex, and depending on the cpu you're emulating it might not be worth it.
Check out my blog: Trashcan of Code
Reply

Sponsored links

#12
How does the xfunction implemented? such as, xGetPtr();
xADD, x86SetPtr......
Core 2 Duo E5200 @ 2.5Ghz /ATI HD5770 /2GB RAM 800 @ 1066Mhz /Asus P5QL /Windows XP SP3
Reply
#13
(03-22-2010, 05:41 AM)whgwbn Wrote: How does the xfunction implemented? such as, xGetPtr();
xADD, x86SetPtr......

They're just writes to memory.
Code:
// The emitter has an internal pointer to memory:
u8* x86ptr;

u8* xGetPtr() { return x86ptr; }
void xSetPtr(u8* ptr) { x86ptr = ptr; }

void xWrite8(u8 data) { *x86ptr++ = data; }

stuff like xADD write the x86-32 opcodes to the memory at x86ptr.
so something like:
Code:
void xADD(u8 gpr0, u8 gpr1) {
    xWrite8(0x84); // opcode number
    xWrite8(gpr0);
    xWrite8(gpr1);
}

the xADD() example above is totally madeup, but it should probably show the idea.

pcsx2's emitter is very sophisticated as far as emitters go since it makes use of operator overloading to make the emitter functions look more like inline assembly syntax...
but that's not shown in the above example since you can look at pcsx2's source for the real implementations ;p
Check out my blog: Trashcan of Code
Reply
#14
Great intro. Thanks! This definitely clears up a whole lot of documents I've read about dynarec. I might as well take a sec to thank the devs for such hard work. PCSX2 is what sparked my interest for programming, and now I've written my own 6502 interpreter (that won me 20 bucks as science fair).

I have a question. I don't understand how getBlock() works. StartPC (I think. Unless I missed something) is just the PC, so if the PC was greater than 1024*16, startPC would be outside the BlockArray bounds. Did I miss something somewhere? Or does the PC just not go above 1024*16? Or am I completely wrong?
Reply
#15
(03-28-2010, 10:32 PM)Urisma Wrote: Great intro. Thanks! This definitely clears up a whole lot of documents I've read about dynarec. I might as well take a sec to thank the devs for such hard work. PCSX2 is what sparked my interest for programming, and now I've written my own 6502 interpreter (that won me 20 bucks as science fair).

I have a question. I don't understand how getBlock() works. StartPC (I think. Unless I missed something) is just the PC, so if the PC was greater than 1024*16, startPC would be outside the BlockArray bounds. Did I miss something somewhere? Or does the PC just not go above 1024*16? Or am I completely wrong?

nah you're correct, we're assuming that the instruction memory is only 16kb. so PC should always be between 0 and 1024*16-1.
in a real emu, somewhere in your code you should probably AND the PC value to guarantee that its in that range though.
e.g. PC &= 0x3fff;

also maybe to give a more concrete example:
the NES uses a 6502 cpu and NES Mapper0 ROMs have at-most 32kb of program-memory.
if you were doing a dynarec for the NES with no-mappers (Mapper0), you can then make the array be 1024*32.
(i've written a NES emu... was also pondering about doing a dynarec for it but haven't had the time)


on the dynarec i did for pcsx2's VU processors, i was able to take advantage of the fact that instructions are 8-byte aligned.
so even though VU1 has 16kb of instruction memory, my blockArray only needed to be of size = (1024*16)/8.

that means however that getBlock() would have to shift the PC value like so:
Code:
u8* __fastcall getBlock(u32 startPC) {
    if (blockArray[startPC/8] == null) {
        blockArray[startPC/8] = recompileSL3(startPC);
    }
    return blockArray[startPC/8];
}
Check out my blog: Trashcan of Code
Reply
#16
cottonvibes,

Thank you for detailed post explaining stuff going on under the hood. For your hypothetical ps2 code:

(03-13-2010, 07:44 AM)cottonvibes Wrote:
Code:
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5

Would you post what the translated emulated instructions in native x86-32 cpu instructions will end up looking like if you used a regular interpreter? And for comparison also post the dynamically recompiled version?

Conceptually I understand the speedup from recompiling but it would be nice to visually see the difference.
Reply
#17
her34:
I know you asked for cotton, but I might be able to help. How an interpreter will actually execute it depends on the compiler, but unoptimized it'd probably look a bit like this: (I think my syntax is probably way off. I'm not good at assembler). In the interpreter, the emulated registers will most likely be in memory, so these would have to be allocated in each function, stored in memory, and re moved back into registers for each instruction. Lots of overhead.
Code:
mov eax, [reg0]
  mov [reg1], eax
  ; store all emulated registers
  ; fetch another instruction here, decode it, etc
  adc reg2, reg3
  mov reg4, reg2
  ; fetch another instruction
  sub regPC, 2 ; regPC is the emulated PC. This is the address fetch
  mov regSomething,  addr; move the address of addr to a register (more in the BR)
  mov regSomethingelse, [regSomething] ; move the actual address into a different register
  mul regSomethingElse, 4 ; multiply the address by 4
  adc regPC, regSomethingElse ; create the new address to branch to in regPC
  ;push things in order to call the C function getBlock, which recompiles if the value of regPC in the blockpointer array points to a null
  ; assuming the return value is in ecx...
  call ecx ; or jmp ecx, either way

Here's the dynarec version:
Code:
mov reg1, reg2
  adc reg2, reg3  
  mov reg4, reg2 ; I don't think x86 can do rd, rs, rt format
  sub regPC, 2
  mov regFree, [addr]
  mul regFree, 4
  adc regPC, regFree
  ; call the get block function
  jmp ecx ; assuming ecx is the return value of getBlock()

Another thing about the interpreter, is that it calls lots and lots of functions. Lots. Which is bad, since it includes about 4 pushes for the registers to be used in the function, pushes for the parameters, pops for the original values when exiting the function, jumps to and from the function, and it has to be generic. That means it has to carry out any possible use of the instruction. Take, for example, that the target code (the stuff being emulated, MIPS r4000 for my purposes. It's fairly easy to understand), looks like this:

Code:
lwi r1, 0xbeef ; load the 16 bit immediate (sign extended into 32 bit) into register 1
  lwi r2, 0xc0de ; load the 16 bit immediate (sign extended into 32 bit)into register 2
  adc r3, r2, r1 ; add register 2 with register 1 and store it in register 3
  addi, r0, r3, 0x1234 ; add the contents of register 3 with immediate 16 bit value and store it in register 1

Now, in an interpreter, each of these would be executed, with all the function calls and what not. That's because its generic. Dynarec, though, offers lots of case specific stuff, which is very good, since these cases will stay the same since the same code will be called. The first optimization that can be made is some constant propagation. the 3rd instruction will always load 0xbeef + 0xc0de, so why not just recompile that into something that moves (0xbeef + 0xc0de) into register 3? That's a huge optimization. The last instruction doesn't even have to be emulated, in many cases (also, the same constant propagation can be used). Why? Well, in MIPS, register 0 is hardwired to be 0, and it can be targeted to compute branching. In an interpreter, this would always be executed. In a dynarec, if the next instruction (or 2 instructions later, I forget how delays and what not work in MIPS) is not a branch, this doesn't even have to be emulated, and the rest of the code can be executed. It it does, then you can just compare ((0xbeef + 0xc0de) + 0x1234) with the branch condition (using x86 cmp instruction), and then use a simple conditional jump. The code will look like this:

Without branching:
Code:
mov r3, (0xbeef + 0xc0de) ; this is computed at recompile time, so its an immediate value

That's it. Really.

With branching:
Code:
mov r3, (0xbeef + 0xc0de) + 0x1234; same as above
  ; if the branching condition is beq (branch equal)
  cmp r3, someOtherThingToBeComparedWith
  jz whatever

Still pretty optimized.


Obviously, an interpreter can do it any number of ways. The only way to know is looking at the code executed in your debugger when executing any emulated instruction.

The speed is in not having to fetch, decode, and reaccess registers and memory so much. I hope I helped a little. Someone more experienced will probably be able to give you a much better idea.
Reply
#18
(06-05-2010, 05:58 AM)her34 Wrote: cottonvibes,

Thank you for detailed post explaining stuff going on under the hood. For your hypothetical ps2 code:

(03-13-2010, 07:44 AM)cottonvibes Wrote:
Code:
MOV reg1, reg0
ADD reg4, reg2, reg3
BR 5

Would you post what the translated emulated instructions in native x86-32 cpu instructions will end up looking like if you used a regular interpreter? And for comparison also post the dynamically recompiled version?

Conceptually I understand the speedup from recompiling but it would be nice to visually see the difference.


for the interpreter it basically does this:

Code:
call fetch;  // fetch the first opcode by calling fetch()
cmp eax, 0;  // result of fetch() is returned in eax, see if its '0'
je LABEL00;  // jump to LABEL00 if equal to 0
cmp eax, 1;  // see if eax is '1'
je LABEL01;  // jump to LABEL01 if equal to 1
cmp eax, 2;  // see if eax is '2'
je LABEL02;  // jump to LABEL02 if equal to 2
jump FIN0;   // jump to FIN0 if none of the above

LABEL00:
    call MOV;  // call MOV() function
    jump FIN0; // jump to FIN0

LABEL01:
    call ADD;  // call ADD() function
    jump FIN0; // jump to FIN0

LABEL02:
    call BR;   // call BR() function

FIN0:

that is the code ran for just 1 instruction, it has to loop that code 3 times to interpret all 3 instructions.

also notice that the above code is calling FETCH(), MOV(), ADD(), and BR(), so that is even more code that the interpreter needs to run. I didn't type it above because that would take too long Laugh



For comparison this is the complete code that gets executed by the recompiler:

Code:
mov eax, cpuRegs[0];
mov cpuRegs[1], eax;

mov eax, cpuRegs[2];
add eax, cpuRegs[3];
mov cpuRegs[4];

jmp 0x22383922; // Address to next block

its short and simple; just those 6 instructions for the whole block Laugh


the c++ compiler may be able to optimize the interpreter code better than what i showed in my example, but it still ends up having a lot more overhead than the recompiled code.

remember that the interpreter is always decoding the instructions 1-by-1, then once it detects which opcode it is, it calls the appropriate function.
the recompiler OTOH just decodes the information once, generates the x86-32 code to memory, and then only the generated x86-32 code needs to be executed in the future.

The first time the recompiler runs it is indeed slower than an interpreter, but the generated code it creates is then run thousands of times. So this makes up for the slowness the recompilation process takes.
Check out my blog: Trashcan of Code
Reply
#19
Because I feel like filling in some blanks, here's what a couple of functions in the interpreter might look like in asm:

Code:
fetch:
mov eax, dword ptr [PC]
movzx eax, byte ptr [cpuMemory + eax]
add [PC], 4
ret

ADD:
mov eax, dword ptr [PC]
movzx ecx, byte ptr [cpuMemory + eax - 2]
mov edx, dword ptr [cpuRegs + ecx*4]
movzx ecx, byte ptr [cpuMemory + eax - 1]
add edx, dword ptr [cpuRegs + ecx*4]
movzx ecx, byte ptr [cpuMemory + eax - 3]
mov dword ptr [cpuRegs + ecx*4], edx
ret
Reply
#20
Actually, since this is a simple machine, let's do the entire interpreter with sensible inlining:

...no, wait a minute, SL3 isn't turing complete. Let's modify it slightly first.

SL4:

32-bit word
256 registers, each one word initialised to 0 and interpreted as two's complement for all operations
Unspecified amount of instruction ROM, no data ROM or RAM

mnemonic - operation - encoding ("/" separates bytes)
ADD reg1, reg2, reg3 - reg1 <- reg2 + reg3 - 00 / reg1 / reg2 / reg3
BEQ reg1, reg2, offset - if reg1 = reg2: PC <- PC + offset * 4 (after PC increment from decoding, so offset is relative to the next instruction) - 01 / reg1 / reg2 / offset (signed byte, units of one word)

Code:
mov esi, 0 ; esi is used as the program counter

MAINLOOP:
add esi, 1

cmp byte ptr [cpuMemory + esi*4 - 4], 1
ja INVALID
je BEQ
; fallthrough

ADD:
movzx eax, byte ptr [cpuMemory + esi*4 - 2]
mov ecx, dword ptr [cpuRegs + eax*4]
movzx eax, byte ptr [cpuMemory + esi*4 - 1]
add ecx, dword ptr [cpuRegs + eax*4]
movzx eax, byte ptr [cpuMemory + esi*4 - 3]
mov dword ptr [cpuRegs + eax*4], ecx
jmp MAINLOOP

BEQ:
mov edx, esi
movsx eax, byte ptr [cpuMemory + esi*4 - 1]
add edx, eax
movzx eax, byte ptr [cpuMemory + esi*4 - 3]
mov ecx, dword ptr [cpuRegs + eax*4]
movzx eax, byte ptr [cpuMemory + esi*4 - 2]
cmp ecx, dword ptr [cpuRegs + eax*4]
cmove esi, edx
jmp MAINLOOP

INVALID:
; todo
jmp MAINLOOP

As you can see, inlining is very important to an interpreter.
Reply




Users browsing this thread: 1 Guest(s)