It seems that you're using an outdated browser. Some things may not work as they should (or don't work at all).
We suggest you upgrade newer and better browser like: Chrome, Firefox, Internet Explorer or Opera

×
Here's the problem - Bulldozers native architecture is far too different from anything that we have seen before on the x86 platform. For it to succeed applications would have to be build around it. Considering how slow the masses are to adopting new technology it seems very unlikely that this will happen in bulldozers lifetime....Moreover AMD simply does not have the clout that Intel has to bring about a foreseeable change in the way software is designed.It's a big chip, packs in a lot of transitors, is power hungry and highly inefficient (This could be remedied in future iterations though) and sacrifices Floating Point performance far too much.



avatar
Lionel212008: Just got word that AMD is contemplating releasing a revision or (B3 stepping) of bulldozer that shall address several issues with the architecture. It was the same with Phenom 1 so I suppose that there might be some measure of hope as yet..For that matter Phenom 2 was a vast improvement over it's predecessor so we will have to see how it all goes.

I really hope that the CPU market becomes competitive again as things don't look so great at the moment.
avatar
wodmarach: contemplating no releasing yes but then thats SOP C0 is due january we've known that for a month or so.
As for your earlier statement about watered down HT it's actually the opposite in multithreaded programs it's actually more efficient as shown in the gain in heavily threaded code (the one area where it beats out the 2600k and even goes up against $1k intel chips)
Post edited October 21, 2011 by Lionel212008
avatar
Lionel212008: Here's the problem - Bulldozers native architecture is far too different from anything that we have seen before on the x86 platform. For it to succeed applications would have to be build around it. Considering how slow the masses are to adopting new technology it seems very unlikely that this will happen in bulldozers lifetime....Moreover AMD simply does not have the clout that Intel has to bring about a foreseeable change in the way software is designed.It's a big chip, packs in a lot of transitors, is power hungry and highly inefficient (This could be remedied in future iterations though) and sacrifices Floating Point performance far too much.
CMT is not new it's been done for decades what is changing is that as processors stay the same speed multi-threading is becoming a necessity, AMD's main problem is they thought most people would be designing multi-threaded code by now, whats happened instead is many coders are so focused on the consoles that at most you'll see 3 threads. In business code the opposite is true threads are used constantly in order to accelerate certain functions.
As for the FPU being slow the exact opposite is true if the scheduler (one of the known broken parts) was working properly and the FPU (as well as the rest of the chip) weren't starved it would be a killer feature much much faster FP math than Intel.
So in all seriousness did you actually look at the uarch or are you just spouting Intel propoganda from the likes of Anand?
avatar
wodmarach: As for the FPU being slow the exact opposite is true if the scheduler (one of the known broken parts) was working properly and the FPU (as well as the rest of the chip) weren't starved it would be a killer feature much much faster FP math than Intel.
So in all seriousness did you actually look at the uarch or are you just spouting Intel propoganda from the likes of Anand?
Except it isn't. Each K10 core had three 128-bit floating point units. These could perform x87 scalar floating point, 128-bit SSE vector floating point, 64-bit MMX vector integer, and 128-bit SSE vector integer operations. Bulldozer has four units in its floating point pipeline. Two are for integer operations (64-bit MMX and 128-bit SSE); the other two are for floating point. In addition to the scalar x87 and vector SSE instructions, the two floating point units can be ganged together, to perform new 256-bit Advanced Vector Extensions (AVX) floating point instructions. Given that this pipeline is now shared between two threads, it's a big reduction in per-thread execution resources.

A K10 core has three ALUs and three AGUs. Bulldozer discards one ALU and one AGU, having just two of each in each of its integer pipelines. AMD claims that the K10's third AGU was superfluous, only there to make laying out the chip easier (by increasing the commonality between each AGU/ALU pair), but the same is not true of the ALU; K10 could execute up to three integer instructions per thread per cycle. Bulldozer tops out at two.

Similarly, the cache and main memory latencies are longer than they are for K10 (four cycles compared to three for level 1 cache; 21 cycles compared to 14 or 15 for level 2; 65 compared to 55 or 59 for level 3; and 195 versus 182 or 157 cycles for main memory). K10's latencies were already worse overall than Sandy Bridge's (which boasts 4, 11, 25, and 148 cycle latencies, from level 1 through to main memory), and Bulldozer makes them worse still.

And this is just comparing it against its predecessor.

The few things Buldozer has going for it are the instruction buffers used for out-of-order execution which are larger, AVX and some AMD specific instructions like FMA which makes floating point operations for code that supports that instruction run twice as fast. But the big thing is that code doesn't really support that right now.

It's also thoroughly amusing to call Orochi a server processor considering that it has only one HyperTransport link enabled out of the four Zambezi has.

And also a huge lol at the Intel propaganda from Anand remark.
avatar
AndrewC: Except it isn't. Each K10 core had three 128-bit floating point units. These could perform x87 scalar floating point, 128-bit SSE vector floating point, 64-bit MMX vector integer, and 128-bit SSE vector integer operations. Bulldozer has four units in its floating point pipeline. Two are for integer operations (64-bit MMX and 128-bit SSE); the other two are for floating point. In addition to the scalar x87 and vector SSE instructions, the two floating point units can be ganged together, to perform new 256-bit Advanced Vector Extensions (AVX) floating point instructions. Given that this pipeline is now shared between two threads, it's a big reduction in per-thread execution resources.
Compare to SB where you have the FPU shared between 2 threads which can only do an operation from a single thread there are some advantages to the Bulldozer FPU that should have seen it romping home in FPU based tests but the scheduler screws it up

A K10 core has three ALUs and three AGUs. Bulldozer discards one ALU and one AGU, having just two of each in each of its integer pipelines. AMD claims that the K10's third AGU was superfluous, only there to make laying out the chip easier (by increasing the commonality between each AGU/ALU pair), but the same is not true of the ALU; K10 could execute up to three integer instructions per thread per cycle. Bulldozer tops out at two.
Now your into the rest of the CPU which is beyond the scope of what I was talking about but yes there is a slow down in the individual cores (about 80% of the performance is the estimate)

And also a huge lol at the Intel propaganda from Anand remark.
Anand annoys me he'll use programs compiled using the intel compiler (still does its GenuineIntel check) and benchmarks that use the same tricks then say "look it runs 20x faster on Intel chips!!!1one" He'll test graphics cards at stupidly low resolutions just to get better results for his fave card of the moment. His site is plastered with Intel ad's but he's completely unbiased -.- same goes for many sites though tbf. I'm sure he's a good guy really but yeah... Fuad is worse though he seems to have a problem saying anything nice about AMD when 590's were exploding at every test site his comment was "at least it's still faster in many tests than the 6990" >.<
The key question here is if Bulldozer is eventually going to turn out be a K10 or a P4(netBurst)...I think Dirk Meyer realized where this was headed and that's the reason that Orochis 45nm predecessor wasn't brought to the market.

Frankly I would reiterate that at its best and even with all optimization's being utilized by software it would still be a watered down form of hyper threading.





avatar
wodmarach: As for the FPU being slow the exact opposite is true if the scheduler (one of the known broken parts) was working properly and the FPU (as well as the rest of the chip) weren't starved it would be a killer feature much much faster FP math than Intel.
So in all seriousness did you actually look at the uarch or are you just spouting Intel propoganda from the likes of Anand?
avatar
AndrewC: Except it isn't. Each K10 core had three 128-bit floating point units. These could perform x87 scalar floating point, 128-bit SSE vector floating point, 64-bit MMX vector integer, and 128-bit SSE vector integer operations. Bulldozer has four units in its floating point pipeline. Two are for integer operations (64-bit MMX and 128-bit SSE); the other two are for floating point. In addition to the scalar x87 and vector SSE instructions, the two floating point units can be ganged together, to perform new 256-bit Advanced Vector Extensions (AVX) floating point instructions. Given that this pipeline is now shared between two threads, it's a big reduction in per-thread execution resources.

A K10 core has three ALUs and three AGUs. Bulldozer discards one ALU and one AGU, having just two of each in each of its integer pipelines. AMD claims that the K10's third AGU was superfluous, only there to make laying out the chip easier (by increasing the commonality between each AGU/ALU pair), but the same is not true of the ALU; K10 could execute up to three integer instructions per thread per cycle. Bulldozer tops out at two.

Similarly, the cache and main memory latencies are longer than they are for K10 (four cycles compared to three for level 1 cache; 21 cycles compared to 14 or 15 for level 2; 65 compared to 55 or 59 for level 3; and 195 versus 182 or 157 cycles for main memory). K10's latencies were already worse overall than Sandy Bridge's (which boasts 4, 11, 25, and 148 cycle latencies, from level 1 through to main memory), and Bulldozer makes them worse still.

And this is just comparing it against its predecessor.

The few things Buldozer has going for it are the instruction buffers used for out-of-order execution which are larger, AVX and some AMD specific instructions like FMA which makes floating point operations for code that supports that instruction run twice as fast. But the big thing is that code doesn't really support that right now.

It's also thoroughly amusing to call Orochi a server processor considering that it has only one HyperTransport link enabled out of the four Zambezi has.

And also a huge lol at the Intel propaganda from Anand remark.
Post edited October 21, 2011 by Lionel212008
avatar
Lionel212008: The key question here is if Bulldozer is eventually going to turn out be a K10 or a P4(netBurst)...I think Dirk Meyer realized where this was headed and that's the reason that Orochis 45nm predecessor wasn't brought to the market.

Frankly I would reiterate that at its best and even with all optimization's being utilized by software it would still be a watered down form of hyper threading.
And yet again you'd be wrong it's the other way round HT is a watered down version of CMT.
The 45nm version was cancelled due to size heat and engineering problems this was known about a year ago. Nobody really expected this to be a current gen intel beater but the uarch as a whole is expected to be good once the bugs are sorted (of which there are many) C0 should see an improvement in speed and Piledriver which began design 2 years after BD (but due out in the next few months) is thought to have the repaired cache and scheduler
I wish that you would allude to some form of reasoning instead of making fallacious accusations and dismissing statements without resorting to any explanations that would insinuate that you have an understanding of either form of architecture.

adding HT to a physical core requires only a small amount of extra logic. So with only a small amount of extra transistors, the CPU can run twice as many threads, and use its execution units more efficiently, since it has two completely independent streams of instructions to feed them.

I see Bulldozer more like a half-HT processor. It's more or less a regular multicore CPU, with HT applied only to the FPU.
As a result, the chip is much larger than a real HT-CPU.
For example, the smallest BD has 4 modules, capable of running 8 threads. Its die size is 315 mm^2.
An Intel 4-core Sandy Bridge (also 8 threads) is only 216 mm^2. An Intel 6-core Gulftown (12 threads) is only 239 mm^2.

So given BD's considerably larger die size, it would be relatively unfair to say "Okay, it's a strength that it has less slowdown when each module runs two threads".
Intel would be able to build an 8-core chip, which can run 16 threads, which would only be marginally larger than BD's 4-module design. In which case it is a clear win when you run 8 threads on both chips, since Intel won't even require HT to do that, while the BD will have to share some of its resources.

It is also not surprising that even Intel's 6-core with HT is able to outperform BD in every muti threaded benchmark. After all, it has more FPU units (one per core, where BD has one per module, so 6 vs 4), more of ALU units (Intel has 3 per core, BD 4 per module, so 18 vs 16), and it can run 12 threads instead of 8.
(Currently we already see that Intel's 4-core CPUs outperform AMD's 6-core CPUs on the desktop, and Intel's 6-core CPUs are a good match for AMD's 12-core CPUs in the server/workstation market).

And THAT is the strength of HT. Because the resources are shared, you waste less. So you get smaller, more power-efficient chips. They are cheaper to build, yields are better, clockspeeds can go up higher (and nice turbo modes for even better single-threaded performance).That is why I say that HT is far more efficient than Bulldozers approach.


We have already established that Bulldozer has far less FPU units so no amount of scheduler
fixing or repairing the cache is going to fix problems that are inherent to the architecture.Since it has less FPU units it would depend heavily on the GPU to remedy that and this isn't a good thing.

It would great if AMD were to be more competitive as it would be the best situation for a consumer like me but that doesn't seem to be happening.



avatar
Lionel212008: The key question here is if Bulldozer is eventually going to turn out be a K10 or a P4(netBurst)...I think Dirk Meyer realized where this was headed and that's the reason that Orochis 45nm predecessor wasn't brought to the market.

Frankly I would reiterate that at its best and even with all optimization's being utilized by software it would still be a watered down form of hyper threading.
avatar
wodmarach: And yet again you'd be wrong it's the other way round HT is a watered down version of CMT.
The 45nm version was cancelled due to size heat and engineering problems this was known about a year ago. Nobody really expected this to be a current gen intel beater but the uarch as a whole is expected to be good once the bugs are sorted (of which there are many) C0 should see an improvement in speed and Piledriver which began design 2 years after BD (but due out in the next few months) is thought to have the repaired cache and scheduler
Post edited October 22, 2011 by Lionel212008