(11-11-2010, 02:36 AM)tallbender Wrote: @Air
how about the "Bulldozer" the new microarcetecture/codename for AMD to be relesed in 2011.
i read in wiki that thing will add up the Intel's moderns SSE's like 4.1 and 4.2 but the TDP is definately that high.
http://en.wikipedia.org/wiki/Bulldozer_(processor)
It's a completely new design. Its dangerous to make assumptions until we get to see one running a variety of code, and some obsessed optimizer like Agner Fog does per-instruction and low-level pipeline analysis on the new design. However, I'll will make some notes on the marketing analysis. First is AMD's
Cluster Multi-Threading (CMT):
Quote:- Two tightly coupled, "conventional" x86 out-of-order processing engines which AMD internally calls modules.
- duplicating integer schedulers and execution pipelines offers dedicated hardware to each of two threads which significantly increase performance in multithreaded integer applications
CMT is AMD's version of Intel's HyperThreading. In
HyperThreading (intel), each core tries to maximize its use of the 4 to 6 internal execution units by executing a second thread's instructions in unused execution units. The second thread will see action whenever the main thread's instruction chain stalls waiting for one of the following:
- memory access
- especially slow instruction (a division, for example)
- a register dependency that prevents the CPU's "out-of-order" execution engine from being able to run a full 4-5 instructions in parallel.
Most of those conditions happen a lot in most apps, and so HT typically has lots of places where it can fill empty execution units of the CPU with tasks from other threads.
AMD's
Cluster Multi-Threading (CMT) works differently, and is broken down into two parts. The first part is that cores are initially bound together as
modules, which allows the core count of the CPU to be a lot higher without having to enlarge the die. Each module of a Bulldozer chip is similar to the original dual-core Athlons in a sense: two very tightly coupled CPU cores that share almost everything except instruction decoder and execution units. Though I'm unclear on the L1 cache situation -- there appear to be two levels of L1 cache on Bulldozer (one for each core and one for the module as a whole) and I'm not sure how those inter-operate.
DECEPTION ALERT: This means that an octa-core Bulldozer chip will
not have eight true cores!! The chip will in fact have four
modules, with each pair of cores in each module competing for a lot of shared resources. Some of you might remember how lousy the original Athlon X2's were at running 2 threads in parallel. This may not be significantly different. I can't predict the exact per-core performance rating, but I can be certain that it will be somewhat below the current generation of multi-core CPUs.
The second part of CMT is that there are actually two ALU/AGU
pairs per core. The ALU/AGU handles
integer and address operations only, which means that this portion of Bulldozer will do little to improve the performance of SIMD-heavy activities (such as encoding videos, image processing, most threaded game logic, etc). The AGU at least is of minimal use to such tasks, but will also be limited by the system's memory bandwidth (which should be pretty well hosed anyway, once you have 8+ threads all trying to access your system's RAM -- a few extra execution units aren't much help when half of the threads are sitting around waiting for their turn to access RAM). Furthermore, if you're running fewer threads than the number of cores on your CPU (which will happen like 98% of the time on an octa-core Bulldozer), the extra ALU/AGU pair of each core will be completely unused.
End result: With the exception of good old integer math, expect Bulldozer to provide increasingly diminished returns as you add up actual threads of work to a job. An octa-core Bulldozer should do exceptional running quad-core tasks; be markedly less stellar with 8-thread tasks; and will probably be sorely disappointing for anything except a few specific integer-heavy tasks when running 12+ threads.
Next is the new AVX and old SSE units:
Quote:- Two symmetrical 128-bit FMAC (fused multiply-add (FMA) capability) Floating Point Pipelines per module that can be unified into one large 256-bit wide unit if one of integer cores dispatch AVX instruction and two symmetrical x87/MMX/SSE capable FPPs for backward compatibility with SSE2 non-optimized software.
Ok, what this means is that the new AVX unit will be developed with 128-bit FMAC in mind, an instruction that does not yet exist on any current market CPU. Using that instruction liberally will be highly beneficial. Speed improvements on existing SSE-based apps are likely not very stellar, though that depends on how the FPPs are implemented (there's no indication how the FPPs will compare to AMD's existing ones, but I doubt they'll be bothering with optimizing it much).
This also means that using 256 bit instructions will
not be notably faster than using 128-bit instructions, since 256 bit instructions can only run one-at-a-time while 128 bit ops can run two in parallel. This means Bulldozer will be a lot like the original Athlon and P3/early P4 chips that typically ran 64 bit MMX instructions much faster than 128-bit SSE instructions. Hopefully AMD will fix that in a later revision of Bulldozer, and give us real 256-bit SIMD support.