Game rebuilding (Or: High-level emulation)
#1
Hello. Within less than two weeks, I'll publish a bare-bones smart Emotion Engine disassembler written in Haskell. The goal is to develop an automated or semi-automated game rebuilder that creates an IR and feed that into another IR like LLVM IR or GCC IR (thus, making a game locked in the PS2 platform available for building in other platforms). 

Since the developers of PCSX2 are far more experienced on this than me, I would like to hear your opinions on the feasibility of BIOS-less, high-level, ahead-of-time game rebuilding. I think it's feasible (because of the reasons below), but, under certain circumstances, it might not be feasible (these specific circumstances are listed below).

Why I think it's feasible: The original developers of PS2 games did not reach the bare metal as far as they could. Although I have found a tutorial (from a homebrewer) telling how to develop IRX modules , apparently this wasn't required for the developers which had access to Sony's SDK (which has multiple versions, but these versions are finite). Developing games is hard as it is, and it looks counterproductive having third-party developers making their own IRX modules or IOP images.

So, the current games (which are likely all games that will ever be available for the PS2), have boilerplate binary code (IOP images, IRX modules, code that was statically linked from Sony's shared objects). Rather than trying to emulate this boilerplate code, it's possible to create a map, where a procedure's binary code hash maps into a high-level procedure. Does the current PCSX2 has an approach like this?

Apparently, even the syscalls were wrapped in procedures. So the original developers saw and wrote:
Exit()

Rather than:
li $v1,4
syscall

And, if indeed all IOP images and IRX modules are boilerplate code, would there be a real need to emulate the other processor of the PS2? The original developers have read Sony's SDK API, without knowing what every single instruction of the BIOS is doing (if PCSX2 needs the BIOS, isn't it because the BIOS has code that PCSX2 has not? And, after receiving the BIOS, the PCSX2 recompiles every single reachable instruction on it).

There are remote procedure calls (called from EE, done on IOP), but these could be faked by tracking the IOP state without actually making it run the low-level code it's supposed to have loaded on itself, right?

Why it might not be feasible: Games that rely on JIT code would mean an immediate defeat. Everything that is dynamic and very hard to be faked ahead-of-time would also mean an immediate defeat. 

As stated on the beginning of the topic, I would like to read what you have to say.
Reply

Sponsored links

#2
Quote:And, if indeed all IOP images and IRX modules are boilerplate code, would there be a real need to emulate the other processor of the PS2? The original developers have read Sony's SDK API, without knowing what every single instruction of the BIOS is doing (if PCSX2 needs the BIOS, isn't it because the BIOS has code that PCSX2 has not? And, after receiving the BIOS, the PCSX2 recompiles every single reachable instruction on it).
IOP images, IRX modules emulations cost is 0. Complete IOP emulation is free. If IOP was emulated in 0 cycle, you would win at best 5-10%.

I'm not even sure that EE emulation is that costly. EE recompiler misses various optimizations. However I know the impact to replace the EE kernel by an HLE implementation. Impact is 0. You just win the overhead of the kernel which is designed to be close of 0.

Quote: There are remote procedure calls (called from EE, done on IOP), but these could be faked by tracking the IOP state without actually making it run the low-level code it's supposed to have loaded on itself, right?
The issue and the cost is timing. It isn't costly to execute the IOP procedure call. The issue is that EE will need to wait the reply. Sometimes is just an idle loop that burn cpu cycle, sometimes it will change to others thread and execute others task.

HLE IOP would be nice to reduce (remove) the dependencies with the bios (I have some code for the EE kernel). Otherwise the perf impact will be 0.
Reply
#3
I would like to reiterate I'm not trying to burden any of you with work. I'm just trying to get the feasability of some things.

There is another point on making a game IOP, IRX and BIOS free: Theoretically, since it does not use any Sony's blob, a third-party developer that made a game for PS2 would be able to republish it on other mediums without worrying about possible litigation. 

Is it possible to defer the recompilation process (including IOP modules) to a third-party library rather than using the PCSX2 custom recompiler (i.e.: to use LLVM)?

And how about the vector units? Using an analogy with OpenGL, they sometimes work like a vertex shader (apply a function to vertexes and by using the XGKICK instruction start a vertex stream which will create triangle fans, triangle strips, lines and other primitives, a behaviour very similar to OpenGL's vertex shader). But the VU memory is indirectly reachable from the EE, so it isn't exactly a vertex shader.

If the VU takes too long to send the vertex to the GS, the drawing process will lag. And apparently, the PS2 GS starts drawing right away when it's possible (if it gets three vertexes from a triangle strip, then there's a triangle to be drawn). But the approach with OpenGL is different: Making large batches of data, so that the expensive drawing calls get called fewer times.
Reply
#4
Sure, yes I know the issue of the bios. It would be nice to have, I started a couple of thing but I got higher priority. But we're all in favor to have an HLE bios due to copyright issue.

Yes it would be possible to generate an LLVM IR. Various people told me that LLVM will be much slower than a custom emitter. I didn't challenge them but you could find info on google saying that LLVM isn't good for JIT stuff.

Forget about the VU and openGL. VU are vector unit so yes there quite close of a vertex shader. But architecture is quite different. However the speed issue isn't to perform the vector operation. SSE2 can do it as fast as well. The issue is to have same result as the PS2 when you do an addition or a multiplication.

GS == rasterization unit + FS + (end of pipeline). There is no draw call in PS2 (at least not at GS level). And you will see that texture is uploaded very often too. Note there are several ways (path) to send data to the GS.
Reply
#5
For anyone that might be interested, the source code for the barebones disassembler can be cloned from here:

https://github.com/frederico-miranda/smart-ee-disasm

There isn't a cabal file (or Makefile) yet, so it has to be manually build like this:

Code:
cd "src/"
ghc Main.hs -o smart-ee-disasm
Reply
#6
"A disassembler with restore labels for the Emotion Engine processor. "

I can imagine pseudonym being interested.
Reply
#7
I have not contacted "pseudonym", but I'm just looking for hints (since you all have seen the real games in action and what they really are doing). I'll keep maintaining the code and keeping it public. If this disassembler code (and related projects) turns out to be useful to PCSX2, feel free to use it.

Weekly report.

More code has been added and now, rather than start at the ELF's entrypoint, the executable disassembles all base procedures (procedures which do not call procedures), which are very important when restoring types. By analizing how registers are being used, it's possible to infer what they originally were. Some examples:

Code:
LABEL_003444B0:
       DADDU $v0, $zero, $zero
       LW $v1, 0($a0)
       BEQ $v1, $zero, LABEL_003444D0
       SLL $zero, $zero, #0
       SLL $v0, $a1, #2
       ADDU $v0, $v1, $v0
       LW $v0, 0($v0)
       SLL $zero, $zero, #0
LABEL_003444D0:
       JR $ra
       SLL $zero, $zero, #0

$a0 (which is guaranteed to be an argument, since its value is used before any assignment) is guaranteed to be a pointer (likely a "int32_t *", but that is not guaranteed), because of the LW instruction. So, when this procedure is called, we know something about the type of the arguments, and can propagate that upwards (the caller knows the types of the called function). 

But the example above also shows the shortcomings of this approach. Is $a1 (an argument) a pointer or is $v0 a pointer (or, very strangely, both are pointers)? It cannot be decided by only looking at the procedure. Further analysis of the whole code might bring more information about the types. (These cases where types could not be definitely decided have hope, just not an optimal solution.)

The reasons for why I'm trying to tell apart what is a pointer and what is not is due to the following: 

Apparently, the IO emulation is done like this: The TLB dictactes how virtual memory addresses will be accessed. The emulator takes advantage of this and uses "mmap".  When a game tries to access the virtual address "0x00422198", that host memory address is accessed (looks good, since apparently it couldn't be more overheadless than this). This approach avoids a call to a procedure which would translate a PS2's virtual address into host address.

However, there might be another way of making this translation from virtual to host cheaper: The use of a pointer is likely consistent (a pointer used for peripheral IO wouldn't be used for accessing a data array or data structure and vice-versa). So, by identifying the pointers and propagating them, only the original pointer would need the conversion from virtual to host (the remaining would simply be subject to pointer arithmetic). I do not have technical reasons for this other approach, I just dislike using "mmap" like this. If I would think of a technical reason to dislike this, it would be portability.

On another topic, according to Gregory,
Quote:Const propagation is limited to a block "from a jump entry to the next jump"

Some base procedures could use a more aggressive constant propagation (Example on pastebin. It does not matter which branch is taken, the "sp" register does not change). It might even be possible to make that code completely branchless (thus avoiding any branch misprediction penalty). That example, by the way, has a lot of IO (many stores and many loads), and all accesses used with $sp as base are within the allocated stack space. Could this improve performance? (Not giving burdens to you).

Code:
       SW $a0, 16($sp)
       SW $a1, 32($sp)
       SW $a2, 48($sp)
       SW $a3, 64($sp)
       SW $t0, 80($sp)
       LW $v1, 16($sp)   ; Dispensable IO. Why not "ADDU $v1, $a0, $zer0" ?

RAM accesses are likely very cheap on PS2 (if not, why are they so abundant on code?), but this isn't the case of desktop computers, where the CPU is much more faster than RAM, right? Or the compilers back then were just poor on optimizations?
Reply
#8
Hum, avoid pastbin (don't have access from work).

Quote:RAM accesses are likely very cheap on PS2 (if not, why are they so abundant on code?), but this isn't the case of desktop computers, where the CPU is much more faster than RAM, right? Or the compilers back then were just poor on optimizations?
EE is a RISC with 32 registers to avoid memory access. RAM access aren't cheap on the PS2. But in reality you still need to load/store the data. And compilers (GCC2) are quite bad on optimization. Register allocation is bad so there are useless push/pop on the stack. I even see a register move to the FPU coprocessor which then written to the memory...

The cost isn't the LW/SW but the extra load of SP, translate the address, update the register. Currently there are lots of RAM access in the emulator to load/store the register content (every instruction). It is costlier than just LW/SW instruction emulation.

Quote:SW $a0, 16($sp)
is translated as
Code:
flush all X86/XMM registers (between 0 and 8 mov from CPU to RAM)
load sp on ecx (normally sp is 64 bits but 32 bits are enough)
load a0 on edx
add 16 on ecx
Various computing to translate ecx into a x86 address
Finally copy edx content in the translated address.

So yes replacing the LW by a "move" would save some computation. However it would only impact final perf if it is often executed.

Quote:However, there might be another way of making this translation from virtual to host cheaper: The use of a pointer is likely consistent (a pointer used for peripheral IO wouldn't be used for accessing a data array or data structure and vice-versa). So, by identifying the pointers and propagating them, only the original pointer would need the conversion from virtual to host (the remaining would simply be subject to pointer arithmetic). I do not have technical reasons for this other approach, I just dislike using "mmap" like this. If I would think of a technical reason to dislike this, it would be portability.
You saw my branch on github. We can only detect 99% of the RAM access and translate is to a move with an immediate. You still need mmap to avoid useless pointer computation to wrap the address. (for example, you can access address 0x0 with 0x0, 0x2000_0000, 0x3000_0000, 0x8000_0000, 0xA000_0000 (0 isn't mapped but it is only an example)). In the future I think we can do this kind of optimization

1/ true direct memory access
2/ keep the often used MIPS register (sp, v0, v1, s0, s1/a0?) in static (callee saved) x86 register
This way we can have this kind of code.
Code:
load A0 on x86_reg
mov ptr[sp + RAM + 16], X86_reg

Anyway, I think the first step is to clean and improve the recompiler. And then it would be nice to use your work to annotate the recompilation process with some kinds of metadata.

As you said above, replace opcode (LW) at address A with opcode (mov). Or block that start at address N is guaranteed to have register A2 to be constant to help constant propagation. However, I have the feeling that perf limitation is on the VUrec side not on the EErec. EErec isn't properly optimized and code is already faster than VUrec.
Reply
#9
How costly will it be for you to compare the 2 way to do const propagation ?

Currently recompiler will generate linear code segment. So if you have this kind of code.
Code:
v = 0x100
c = 0x10;
b;
do {
lw reg, (v)
b  = (reg >> 10) & 3;
c--;
} while (c > 0)
Current compiler will split the code in 2 segment. "before the loop" and "core of the loop". Due to the split, v won't be propagated as a constant. In your case, you have the full information, so you can have the "guarantee" (except special case) that v is constant. If you can count the number of instruction that can be simplified.
N instruction becomes NOP
M instruction got 1 extra parameter that is a constant.

The goal is to have some value to quantify the improvement.
Reply
#10
(02-08-2016, 03:36 PM)gregory Wrote: How costly will it be for you to compare the 2 way to do const propagation ? (...) The goal is to have some value to quantify the improvement.

Yes, data collected from experiments says a lot more than speculations. However, to be very honest, I won't even be able to compare it. About a year ago (or more than that), I've attempted to compile PCSX2, installed a natively 32-bit distro just to get around the issues of cross-compilation, but got stuck on an annoying dependency ("nvidia-cg-toolkit". I didn't even had a NVidia video card). Someone else managed to compile it and made an installable RPM, so I just installed the readily-available binary. And it was awful. The input lag was very noticeable (A button is pressed and two seconds later the invoked menu appears). I've tried recently to compile it again, but cmake's output saying "this or that is missing" invokes all the bad memories from those days and defeats my willingness to keep going.

Some entries on the developer's blog ("C++ exceptions can be an optimization"), game-specific hacks in the source code files, the windows-centric and nvidia-centric development style, made me deeply untrust the quality of PCSX2's source code. So if the emulator isn't working as well as I expect it to work, I'm biased towards blaming the software rather than blaming something else like the hardware. And, even if I blame the software, the questions remains: "Where in the software?" "Is the EE recompilation just fine, but the VU recompilation is the true problem?" "Or is because the JIT compiler has to be called again and again because the instruction memory is constantly changing?" (Thus dynarec might make more sense than anything that needs to read the bytecode throughly before actually translating anything)

(And, please forgive me for the oranges-and-apples comparison here, but a very affordable motherboard with integrated Intel Celeron 1037u can run games on Dolphin with lagless video playback, on Linux. The recommended requirements of PCSX2 would cost more than tenfold that).

I am shy and careful, and that makes me reveal things in a slow pace. Personally, I think I'm being more hasty than usual. The source code I'm working on is currently in an ugly shape, but I wanted to start a conversation (and give evidence of commitment and helpfulness), so I rushed the evidence. Tidy things ("data/patterns/all-instructions.txt") are easier to read, yet I presented a mess.

I do not know which improvements are more relevant for the emulator (and my goal is also different: creating a tool that will enable publishing PS2 games on other mediums), but I dislike repeatings mistakes. If I'm going to make a mistake, at least it should be something new (thus I want to understand PCSX2 flaws).

Quote:You saw my branch on github.

Yes, I've been reading your commit comments in order to grasp what is going on, but I'm keeping distance from PCSX2's source code. When I was looking through the source code folders in order to find the recompiler in order to understand how much work it would require to move PCSX2 from x86 to x64 (or "noarch"), I found the "pcsx2/x86" folder. And the transformation of "EE bytecode" into "x86 bytecode" is so tightly coupled that uncoupling these two would be doing something from scratch. 

I'll keep working on the disassembler code. I have to increase its contents and tidy it. If I were to do that comparison, it would be outside PCSX2.

And, related to the loop problem you have presented:

I have seen a similar problem somewhere else. Single static assignment also handle loops. But when you think about the name, maybe something will feel weird: how can you make a loop, if any variable is assigned only a single time?

See this paper (written by a caring teacher), on section 3, "Loops": https://www.cs.cmu.edu/~fp/courses/15411...06-ssa.pdf

This is a machine code with the problem similar to yours:
Code:
pow(b,e):
 r <- 1
loop:
 if (e <= 0) goto done
 r <- r * b
 e <- e - 1
 goto loop
done:
 return r

This is what is done on SSA (the labels receive parameters, for the reasons explained on the paper):

Code:
pow(b,e):
 r <- 1
 goto loop(b,e)

loop(b,e,r):
 if (e <= 0) goto done(r)
 r <- r * b
 e <- e - 1
 goto loop(b,e,r)

done(r):
 return r

And, going further, TAA-DAH! You get the magic you were looking for:

Code:
pow(b0,e0):
  r0 <- 1
  goto loop(b0,e0,r0)

loop(b1,e1,r1):
  if (e1 <= 0) goto done(r1)
  r2 <- r1 * b1
  e2 <- e1 - 1
  goto loop(b1,e2,r2)

done(r3):
  return r3

Notice how every variable on the source code above is assigned only once. There is no "core of the loop" or "before loop" on single static assignment. There are assignments and functions calling other functions. Haskell, by the way, is a single static assignment language (there is no mutability, thus a variable can be assigned only once). Yet it's possible to develop programs with it.

Have you developed something with a functional programming language, Gregory? Haskell has no do-while or for-loop. It looks crazy for someone used to imperative languages, where loops are used everywhere. And, all of a sudden, the words you would need to express your thoughts are gone. If you have only used imperative languages, functional languages will make your world go upside-down.
Reply




Users browsing this thread: 1 Guest(s)