Login

3kinox · 10-14-2015, 10:00 PM

I redo PR now.
I will concentrate on evdev now, whether this is merged or not, at least he can begin to review whenever he has time, it is not critical for next step.
New GUI is only onepad for now, PCSX2 in general could come one day, but need to motivate him, let's go slowly haha. Anyway, if really the whole GUI was to be rewritten then maybe going the Qt road would be the best. It would avoid jokes with sdl, broken ABI and all the wxwidget funny surprises one can have.

As for perf gain in 64 bit. To be short, overall and if it is well done, yes. To be a bit more verbose: the ps2 essentially uses a 4 way simd with 32bit floating point e.g. in one "cycle"(in fact it could be longer but well) it can proceed 4 different operands(make 4 addition or whatever). Actually x86_32 CPU can do that too, so the gain is not fully there(though there might be some with tricks). The main problem is that our dear ps2 has 32 registers to do that whereas x86_32 has... 8. x86_64 has 16 which could allow to release some pressure(especially with a good register allocator, otherwise it's probably useless) and thus gain perf. So if the bonus registers are used, and used well, it might very well give a nice perf boost. There are other spots where you can gain, EE is a 64bit CPU so its integer operation could be done faster too. Overall Dohpin boost was around 30% if my memory does not fail me. You could expect roughly the same here I guess. It would also ease the work of the guy doing the optimising phase which is always nice... Whether you use linear scan or graph coloring for register allocation, reducing spilling through having more registers is a good thing, really, as choosing what you spill is NP-Hard and so it is usually approximated with heuristics. And heuristics are well, never perfect, linear scan is quite good for JIT/DBT but even then you sometimes have to shrink it down even more when translation speed is really of the essence... Anyway, there ends the Compilation course haha, I have enough with my students already I don't need to make even more teaching hours Smile

though if you ever want to learn computer science, don't hesitate to ask. As long as you don't ask me "Is it going to be part of the exam?" every 5 minutes Smile

shoegazer · 10-15-2015, 01:08 AM

Noticed the reworked code and the PR - thanks. Smile

And yeah, totally agreed that Qt would be a nice complement to PCSX2, beyond wxwidgets. I always struggle trying to click around the current interface with its tiny drop-downs and various other issues, I put up with it because the end result is great, but the experience could be much better of course. I hate to keep comparing to Dolphin, but its UI is eons better and even that one could use a shiny coat of paint (as the devs have recognized). That said, other matters of priority (see below) combined with limited PCSX2 dev resources make the current situation entirely understandable. Patience is a virtue as with anything.

Regarding your other discussion: 32 registers vs. 8 seems like a major strain, ANY relief from that would be nice. I did observe a major performance increase with 64-bit Dolphin (30% sounds about right iirc), and while the GC/Wii are entirely different architectures it's probably not entirely fair to compare, I do see where performance gains could be made regardless. Thanks so much for the mini-course, btw! It's really fascinating stuff and worth some more discovery through further reading/research. I might take you up on more later, though no worries about exam questions. Wink

**gregory** · 10-15-2015, 08:23 PM

Register topic is a bit more complex.
1/ EE code barely uses the 32 register. I'm sure 90% of the operation are done with half of them.
2/ 32 bits has MMX register.
3/ I suspect that internally the x86 CPU has a bits more register.
For sure it will help but not that much. However improving memory management, as Dolphin did, could yield a very nice speed boost. You replace 5-6 instructions to emulate a PS2 memory access by a 1-2 instructions to virtualize the PS2 memory access.

Porting the code to 64 bits will be interesting for me because I would like to learn really the internal of the recompiler.

3kinox · 10-15-2015, 09:57 PM

(10-15-2015, 08:23 PM)gregory Wrote: Register topic is a bit more complex.
1/ EE code barely uses the 32 register. I'm sure 90% of the operation are done with half of them.
2/ 32 bits has MMX register.
3/ I suspect that internally the x86 CPU has a bits more register.
For sure it will help but not that much. However improving memory management, as Dolphin did, could yield a very nice speed boost. You replace 5-6 instructions to emulate a PS2 memory access by a 1-2 instructions to virtualize the PS2 memory access.

Porting the code to 64 bits will be interesting for me because I would like to learn really the internal of the recompiler.

1/ ha yeah I ever hear that actually, maybe read from you. But VU also has it's pack of 32 no? And since it's vectorial CPU I would guess that it uses them... x86 probably has hundreds internally, that and all the write buffers and niceties that hide memory access in a very impressive way. You can make the experience of going through a chained list and array first on x86 and then on ARM, it's quite fun. Comparing a Java on interp and with dynamic recompile too. They clearly hide a ***** ton of stuff. They also got a new branch prediction magic now(since haswell or ivy). Intel, black magic inside!
2-3/ Using MMX is good and all, it does tackle the SIMD part in the end code. However, tackling SIMD in DBT is clearly not simple, though it has been done in "good" ways in research this requires a few tricks. You can read http://www.researchgate.net/profile/Nico...000000.pdf and http://adt.cs.upb.de/quf/quf11/quf2011_12.pdf
The main part is that, whatever you do, and whatever backend intel has, frontend is 8 registers. So your register alloc runs with a limit of 8 colors. Complexity is thus the same whatever the backend is. And so sacrifices you make are the same too. Especially since you don't use a good IR(SSA or whatever) and thus don't have access to fast and efficient algorithms(unless you really hurt yourself).
Hmm If you want to learn, I guess you can search in usenix for founding Qemu article from fabrice bellard https://www.usenix.org/legacy/event/usen...lard_html/. It's pretty much the state of the art even nowadays(except for optimizers of course, but these are another topic).
After yes, I don't expect it to be an improvement of 100%. 30% would be very good. And obviously memory accesses will be one of the dominant gains. After I don't really know the core of the implem done here(cache management and such, I do know that there is some kind of code cache).
Though I do think that if you spend time to create a new dynamic translator, and do it up to the state of the art you will see a gain not only in speed but in ease of dev and obviously portability.
I can give you a nice and big bibliography before you launch yourself in code if you want Wink

enough to avoid you to sleep forever. Btw, I remember I ever read you say you work in hardware, you are in research or industry(If research I do guess I would have spotted you in Grenoble Wink

)?

**gregory** · 10-15-2015, 10:27 PM

Thanks for the links. Actually I think is used for the 64 bits comparaison otherwise most of the ee code is 32 bits with aign extension on the upper 32 bits. But I don't know very much the internal.
VU register is another story. However EE can potentially be optimized (I know current dynarec is limited to a jump/branch but not sure it is mandatory). EE operations are often destructive, so there are a lots of move to duplicate the register content. And I suspect that ggc2 doesn't optimize the code that much.

I used to work on smartphone chip (verification and dft). Nowadays I'm on supercomputer chip ( accelerate networks between cpu)

3kinox · 10-15-2015, 11:01 PM

Hmm for move and all, I guess the best is to have a good folding pass and remove dead variables generated through the moves(I do guess there is a few) then, however, not sure you can do it in your code cache. Probably can but painful(you will have to post modify addresses...).
Hmm what do you mean whether jump/branch is mandatory? I don't want to make affirmation on what is done in there. Otherwise, that seems normal to me that part, you fetch target code until you find a branch, then you generate code, execute it, see where you jump next, if target has already been seen and is in the cache, you make your jump, otherwise you translate a new basic, block optimize it, execute it and so on. The idea is to avoid having to translate/optimize what you have already seen. After, when code cache is full or you have self modifying code you can not patch up, the simplest is to flush and redo. But considering ps2 memory quantity, and considering self modifying code does not seem so likely there(I don't think game engines used JIT yet) the cache should run quite smoothly with the simple scheme.

**gregory** · 10-16-2015, 05:35 PM

Quote:you fetch target code until you find a branch

yes this part. I would be nice to do blocks that execute a complete function. It would allow optimizations of useless move/memory addressing. But I never look at the game ASM code, only the EE kernel, so dunno.

Quote: But considering ps2 memory quantity, and considering self modifying code does not seem so likely there(I don't think game engines used JIT yet) the cache should run quite smoothly with the simple scheme.

On the EE, potentially I see the modification of a "pseudo" linker table but I don't know for others self-modifying code. However I'm sure that if it can be done, at least one game does it. Somehow, I was always curious if it would be possible to precompile the main ELF and fallback to JIT in case of self-modifying code. (PS2 is a devil arch that loves to prove you that you're wrong).

3kinox · 10-16-2015, 05:58 PM

(10-16-2015, 05:35 PM)gregory Wrote: yes this part. I would be nice to do blocks that execute a complete function. It would allow optimizations of useless move/memory addressing. But I never look at the game ASM code, only the EE kernel, so dunno.

Hmm complete function is harder, actually you pretty much can't do it. The main problem is that you have absolutely no idea a priori of where you jump. So you must do it on the fly. And so it will not always be functions, far from that. You get pretty much a basic block though. And that's already good for first optimizing pass. The second part is to define hotpaths, detect them, and perform optimizations on the whole. However, in case of DBT it is hard. Real hard. It's open research. Not so much for detecting and optimizing in fact. But because handling a code cache that supports that can easily kill ALL perf gain you did.

(10-16-2015, 05:35 PM)gregory Wrote: On the EE, potentially I see the modification of a "pseudo" linker table but I don't know for others self-modifying code. However I'm sure that if it can be done, at least one game does it. Somehow, I was always curious if it would be possible to precompile the main ELF and fallback to JIT in case of self-modifying code. (PS2 is a devil arch that loves to prove you that you're wrong).

Pre compiling would certainly be a good idea. It's not easy either(as you don't know the target of all branches) but doable to pre compile every BB.There will always be self modifying code(any kernel have some). It's not the arch that matters here(at all). Its what the software does.The point is that I don't think it's ultra common. E.g. they are probably not running a research JVM doing crazy optimizations or Qemu on top(or JVM on top of Qemu Wink

). Anything less than that can be handled without much pain. With some TLB tricks, a few more flush can even be avoided.

**gregory** · 10-16-2015, 10:27 PM

Oh I didn't know that TIMA does some QEMU stuff. I vaguely remember doing some VHDL here for TP school Tongue2

I think all branches are relative or absolute. Only some jump could depends on register content. Most of them are the return address of the function. I nearly sure that branches remains inside a function. But yeah I understand what you mean.

EE kernel is rather slim. around 85K or 20K instructions. Kernel updates some global variables but I don't recall any part that does some self modifying code. Which part would need SMC? Sure it is likely uncommon but I'm nearly sure you can find some games that do bad stuff. Game devs have sometimes crazy ideas.

3kinox · 10-16-2015, 11:38 PM

Actually, since these crazy person do tend to use some interp code for artists and stuff, they got the idea to use bytecode, which they use to make in stack replacement... Like Java. So yes, they DO have SMC. But not a crazy lot. But it's enough so that naive approach will fail with very very strange bugs yeah Wink

Same crazy stuff always.
Yes, we do actually, at the 4th floor (SLS). We also have native simulation parts. For Qemu, we use RABBITS which is a version of Qemu hacked to use SystemC models and heterogeneous being possible. Ha? you came here when?(probably before me Wink

). I do work on vanilla Qemu though, since I am in TCG directly(optimizing stuff).
Kernel mainly has for tlb stuff, and for loading themselves in general, usually, after first phase of boot, kernel copy itself at the higher addresses for ease of dev reasons(not all, but most).

Login
Username:
Password:	Lost Password?
	Remember me