I'm very excited about this, as it's at least 2 decades overdue. When Pentiums w...

sliverstorm · on Aug 7, 2014

I remember thinking that their deep pipelines for branch prediction and large on-chip caches meant that fabs were encountering difficulties with Moore's law

It's really a combination of memory latency and pipelining.

Memory latency is absolutely terrible compared to processor speed, and that has nothing to do with Moore's law. It's 60ns to access main memory, which is ballpark 150 cycles. If you have no caches, your 2.5Ghz processor is basically throttled to 16Mhz. You can buy some back with high memory bandwidth and a buffer (read many instructions at a time). But if you have no predictor, every taken branch flushes the buffer and costs an extra 150 cycles- in heavily branched code your performance approaches 8Mhz.

Then think about pipelining. We don't pipeline because Moore's law has ended. We pipeline because a two-stage pipeline is 200% as fast as an otherwise identical unpipleined chip. A sixteen-stage pipeline is 1600% as fast. Why the hell wouldn't you pipeline? Now, of course in the real world branched code can tank a deep pipeline. Which is where the branch predictor comes in, buying back performance.

http://stackoverflow.com/questions/4087280/approximate-cost-...

p1esk · on Aug 8, 2014

>>> If you have no caches, your 2.5Ghz processor is basically throttled to 16Mhz.

No. This is only true if every instruction tries to access memory.

>>> We pipeline because a two-stage pipeline is 200% as fast as an otherwise identical unpipleined chip. A sixteen-stage pipeline is 1600% as fast.

No. First of all, each stage in the pipeline will be equal to the slowest stage. Second, there will be significant overhead of passing data through pipeline registers, and of control logic for those registers.

The reason we saw 32 stage pipelines in P4 was mostly marketing: "megaherz race" between AMD and Intel.

sliverstorm · on Aug 8, 2014

You are right, there is appreciable overhead in pipelining and the benefit is not quite as powerful as I claimed. I am guilty of an age-old crime, simplifying a complex subject for the layman and skipping real details in the process.

But you can be certain that AMD and Intel do not design 20+ stage pipelines for some measly 10% performance uplift. The overhead of the pipeline infrastructure is nowhere near the performance gain. Consider Haswell has an IPC around 2 instructions per cycle. With a ~20 stage pipeline, they are indeed far outstripping the performance of "Haswell minus pipelining".

As for the super-deep pipeline in the P4, the consensus I hear is that Intel expected frequency to keep scaling, and as such the P4 was a future-looking architecture designed to scale to 10GHz and beyond.

kevinnk · on Aug 8, 2014

>>> No. This is only true if every instruction tries to access memory.

Every instruction must be loaded from memory in order to execute it. Hence instruction caches.

p1esk · on Aug 8, 2014

Yes, you're right, I missed that.

bane · on Aug 7, 2014

I remember the first time the von Neumann architecture was laid out for me and me thinking "woah that's bottlenecked" and immediately thinking it would make more sense to do the computation where the memory was, or replace "memory" with just a huge pile of registers or something other than what I was looking at.

This is really exciting stuff, I can't help but think a marriage of this approach with HP's memristor technology would bring us screaming along an amazing architecture path for the next several decades.

But then again, I'm concerned that the limited use cases for this being presented are basically already performed by various custom (and cheap and power efficient) DSPs. Is all that's really being envisioned here just a lower power alternative to DSPs? I think the vision can be much bolder.

sliverstorm · on Aug 7, 2014

Of course it would make sense to do the computation where the memory is. Trouble is the memory area is dramatically larger than the computation area.

Imagine if you were a reference librarian, asked for facts like some kind of ancient Google. Suppose your library was the size of your bedroom- you could very quickly find facts. You only have to cross the room. Now suppose you are right in the middle of the Library of Congress. You are smack dab in the middle of it- you are where the memory is. But you're still going to spend half your time just running about the building due to its sheer size!

The only ways to solve that problem are:

- Make memory smaller. Engineers have been hard at work at this for decades.

- Use less memory. This is slower.

- Use a memory hierarchy. This is what we do today, and is analogous to you sitting in a bedroom-sized library with the Library of Congress just down the street, and a young courier who fetches you books from it.

The other challenge is speed. We can't have a huge pile of registers because fast memory is less-dense than slow memory. So 1KB of CPU registers occupies a lot more space than 1KB of DRAM- but DRAM is a poor choice for registers because of how slow it is.

Someone · on Aug 8, 2014

You are forgetting

- use more CPUs

With a billion perfectly cooperating (that's the research problem) librarians, searching the library of congress is way faster.

kevinnk · on Aug 8, 2014

>>> Why are we doing this to ourselves when we have SO much chip area available now and could scale performance linearly with cost?

Multicore performance doesn't scale linearly because 1) adding more cores has rapidly diminishing returns on performance for most problems (http://en.wikipedia.org/wiki/Amdahl's_law) and 2) the cost of coherency is exponential with the number of cores.

m_mueller · on Aug 9, 2014

I wonder whether you're dismissing today's GPUs too quickly. The way they work (and are programmed) today is about half way between CPUs and the linked architecture. They're generally applicable for an order of 10E3 parallel computations, with 10E5-10E6 threads they can be saturated. Whether you have computationally bounded or memory bandwidth bounded algorithms doesn't really matter (both is faster on GPU), what matters is a sufficiently long runtime for the parallelizable part of an application as well as not too much branching for computationally bounded kernels (there is a point where CPUs become faster when there's too many branches since for example on Kepler architecture for each branch, the neighbouring 192 cores are locked together).

The philosophy of today's GPU architecture is basically quite simple: Maximize memory throughput by using the fastest RAM that's still cheap enough for consumers, then maximize die space for the ALUs by letting bundles of them share scheduler, register blocks and cache. I was first very skeptical about this too, but to my experience it has proven quite effective - even parallel algorithms that are not ideal for this architecture still profit from the raw power, and they continue getting benefits when you buy new cards, in a fashion that's much closer to Moore's law than CPUs develop.

The architecture certainly isn't ideal and would be solved by an architecture like in your link (to which Parallela also comes quite close btw), and I can well imagine that this is where we're heading given another 5-10 years (see Parallela, to some extent Knight's Landing). However it's also feasible that the GPU's ALU maximisation game will win out, especially once 3D-stacked memory comes into play.

Since 2008 there have been many papers about NNs implemented on GPU and I'd love to know what's the current status there, especially compared to the very powerful Power8 architecture.