I'm very excited about this, as it's at least 2 decades overdue. When Pentiums were getting popular in the mid 90s, I remember thinking that their deep pipelines for branch prediction and large on-chip caches meant that fabs were encountering difficulties with Moore's law and it was time to move to multicore.
At the time, functional programming was not exactly mainstream and many of the concurrency concepts we take for granted today from web programming were just research. So of course nobody listened to ranters like me and the world plowed its resources into GPUs and other limited use cases.
My take is that artificial general intelligence (AGI) has always been a hardware problem (which really means a cost problem) because the enormous wastefulness of chips today can’t be overcome with more-of-the-same thinking. Somewhere we forgot that, no, it doesn’t take a billion transistors to make an ALU, and no matter how many billion more you add, it’s just not going to go any faster. Why are we doing this to ourselves when we have SO much chip area available now and could scale performance linearly with cost? A picture is worth a thousand words:
I can understand how skeptics might think this will be difficult to program etc, but what these new designs are really offering is reprogrammable hardware. Sure, we only have ideas now about what network topologies could saturate a chip like this, but just watch, very soon we’ll see some wizbang stuff that throws the network out altogether and uses content addressable storage or some other hash-based scheme so we can get back to thinking about data, relationships and transformations.
What’s really exciting to me is that this chip will eventually become a coprocessor and networks of these will be connected very cheaply, each specializing in what are often thought of as difficult tasks. Computers are about to become orders of magnitude smarter because we can begin throwing big dumb programs at them like genetic algorithms and study the way that solutions evolve. Whole swaths of computer science have been ignored simply due to their inefficiencies, but soon that just won’t matter anymore.
I remember thinking that their deep pipelines for branch prediction and large on-chip caches meant that fabs were encountering difficulties with Moore's law
It's really a combination of memory latency and pipelining.
Memory latency is absolutely terrible compared to processor speed, and that has nothing to do with Moore's law. It's 60ns to access main memory, which is ballpark 150 cycles. If you have no caches, your 2.5Ghz processor is basically throttled to 16Mhz. You can buy some back with high memory bandwidth and a buffer (read many instructions at a time). But if you have no predictor, every taken branch flushes the buffer and costs an extra 150 cycles- in heavily branched code your performance approaches 8Mhz.
Then think about pipelining. We don't pipeline because Moore's law has ended. We pipeline because a two-stage pipeline is 200% as fast as an otherwise identical unpipleined chip. A sixteen-stage pipeline is 1600% as fast. Why the hell wouldn't you pipeline? Now, of course in the real world branched code can tank a deep pipeline. Which is where the branch predictor comes in, buying back performance.
>>> If you have no caches, your 2.5Ghz processor is basically throttled to 16Mhz.
No. This is only true if every instruction tries to access memory.
>>> We pipeline because a two-stage pipeline is 200% as fast as an otherwise identical unpipleined chip. A sixteen-stage pipeline is 1600% as fast.
No. First of all, each stage in the pipeline will be equal to the slowest stage. Second, there will be significant overhead of passing data through pipeline registers, and of control logic for those registers.
The reason we saw 32 stage pipelines in P4 was mostly marketing: "megaherz race" between AMD and Intel.
You are right, there is appreciable overhead in pipelining and the benefit is not quite as powerful as I claimed. I am guilty of an age-old crime, simplifying a complex subject for the layman and skipping real details in the process.
But you can be certain that AMD and Intel do not design 20+ stage pipelines for some measly 10% performance uplift. The overhead of the pipeline infrastructure is nowhere near the performance gain. Consider Haswell has an IPC around 2 instructions per cycle. With a ~20 stage pipeline, they are indeed far outstripping the performance of "Haswell minus pipelining".
As for the super-deep pipeline in the P4, the consensus I hear is that Intel expected frequency to keep scaling, and as such the P4 was a future-looking architecture designed to scale to 10GHz and beyond.
I remember the first time the von Neumann architecture was laid out for me and me thinking "woah that's bottlenecked" and immediately thinking it would make more sense to do the computation where the memory was, or replace "memory" with just a huge pile of registers or something other than what I was looking at.
This is really exciting stuff, I can't help but think a marriage of this approach with HP's memristor technology would bring us screaming along an amazing architecture path for the next several decades.
But then again, I'm concerned that the limited use cases for this being presented are basically already performed by various custom (and cheap and power efficient) DSPs. Is all that's really being envisioned here just a lower power alternative to DSPs? I think the vision can be much bolder.
Of course it would make sense to do the computation where the memory is. Trouble is the memory area is dramatically larger than the computation area.
Imagine if you were a reference librarian, asked for facts like some kind of ancient Google. Suppose your library was the size of your bedroom- you could very quickly find facts. You only have to cross the room. Now suppose you are right in the middle of the Library of Congress. You are smack dab in the middle of it- you are where the memory is. But you're still going to spend half your time just running about the building due to its sheer size!
The only ways to solve that problem are:
- Make memory smaller. Engineers have been hard at work at this for decades.
- Use less memory. This is slower.
- Use a memory hierarchy. This is what we do today, and is analogous to you sitting in a bedroom-sized library with the Library of Congress just down the street, and a young courier who fetches you books from it.
The other challenge is speed. We can't have a huge pile of registers because fast memory is less-dense than slow memory. So 1KB of CPU registers occupies a lot more space than 1KB of DRAM- but DRAM is a poor choice for registers because of how slow it is.
>>> Why are we doing this to ourselves when we have SO much chip area available now and could scale performance linearly with cost?
Multicore performance doesn't scale linearly because 1) adding more cores has rapidly diminishing returns on performance for most problems (http://en.wikipedia.org/wiki/Amdahl's_law) and 2) the cost of coherency is exponential with the number of cores.
I wonder whether you're dismissing today's GPUs too quickly. The way they work (and are programmed) today is about half way between CPUs and the linked architecture. They're generally applicable for an order of 10E3 parallel computations, with 10E5-10E6 threads they can be saturated. Whether you have computationally bounded or memory bandwidth bounded algorithms doesn't really matter (both is faster on GPU), what matters is a sufficiently long runtime for the parallelizable part of an application as well as not too much branching for computationally bounded kernels (there is a point where CPUs become faster when there's too many branches since for example on Kepler architecture for each branch, the neighbouring 192 cores are locked together).
The philosophy of today's GPU architecture is basically quite simple: Maximize memory throughput by using the fastest RAM that's still cheap enough for consumers, then maximize die space for the ALUs by letting bundles of them share scheduler, register blocks and cache. I was first very skeptical about this too, but to my experience it has proven quite effective - even parallel algorithms that are not ideal for this architecture still profit from the raw power, and they continue getting benefits when you buy new cards, in a fashion that's much closer to Moore's law than CPUs develop.
The architecture certainly isn't ideal and would be solved by an architecture like in your link (to which Parallela also comes quite close btw), and I can well imagine that this is where we're heading given another 5-10 years (see Parallela, to some extent Knight's Landing). However it's also feasible that the GPU's ALU maximisation game will win out, especially once 3D-stacked memory comes into play.
Since 2008 there have been many papers about NNs implemented on GPU and I'd love to know what's the current status there, especially compared to the very powerful Power8 architecture.
At the time, functional programming was not exactly mainstream and many of the concurrency concepts we take for granted today from web programming were just research. So of course nobody listened to ranters like me and the world plowed its resources into GPUs and other limited use cases.
My take is that artificial general intelligence (AGI) has always been a hardware problem (which really means a cost problem) because the enormous wastefulness of chips today can’t be overcome with more-of-the-same thinking. Somewhere we forgot that, no, it doesn’t take a billion transistors to make an ALU, and no matter how many billion more you add, it’s just not going to go any faster. Why are we doing this to ourselves when we have SO much chip area available now and could scale performance linearly with cost? A picture is worth a thousand words:
http://www.extremetech.com/wp-content/uploads/2014/08/IBM_Sy...
I can understand how skeptics might think this will be difficult to program etc, but what these new designs are really offering is reprogrammable hardware. Sure, we only have ideas now about what network topologies could saturate a chip like this, but just watch, very soon we’ll see some wizbang stuff that throws the network out altogether and uses content addressable storage or some other hash-based scheme so we can get back to thinking about data, relationships and transformations.
What’s really exciting to me is that this chip will eventually become a coprocessor and networks of these will be connected very cheaply, each specializing in what are often thought of as difficult tasks. Computers are about to become orders of magnitude smarter because we can begin throwing big dumb programs at them like genetic algorithms and study the way that solutions evolve. Whole swaths of computer science have been ignored simply due to their inefficiencies, but soon that just won’t matter anymore.