This article would make more sense if it had the result of a simulation of a workload showing how much time was lost to interrupt latency and how much processor time could be saved by a different technique.
> We could have a simple cycle timer switch on each core so that after the timer expires there an interrupt-like jump to a function to see what to do next. That jump would be perfectly synchronous since predicting the next jump can be done with 100% accuracy (or nearly 100%).
In other words a timer interrupt - with saving of state and appropriate unwinding of pipeline state (abandoning half done or out of order instructions etc etc)
Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"
Not necessarily; it's an interesting idea. Thanks to the branch predictor the CPU already has a virtual view of the instruction stream. If we tolerate a bit of latency, all we have to do is inject a "jump to ISR" magic instruction in the predicted stream. Rather like self-modifying code, except without modifying the code in memory, just at the instruction fetch point. State still has to be saved but that can be done with PUSH instructions in the ISR.
> Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"
Can be done with mailboxes/FIFOs, but yes this requires a dedicated design. And of course the CPU that does the call is then idle I think?
Good explanation of my overly terse note.
The current standard interrupt architecture imposes enormous latency and the synchronous timer could be more precise. The simplest implementation would just "fetch" jmp every N instructions (with N programmable) - just like voluntary switching but where the processor would volunteer the program.
That does not solve the problem at all. It just increases the number of "hyper threads", if a new process gets started and all cores are busy that process might never run.
It solves the problem for environments where problems like interrupt latency and timing criticality usually show up - embedded and real-time systems. In many systems, the set of running tasks in a system is fixed - there are even some very simple real-time operating systems (such as some OSEK configurations in the automotive sector) which require to statically define the set of tasks at compile time. After all, you don't suddenly feel the urge to start a game of Doom on your car's ABS controller :) (though, of course, somebody will try to do this...).
The (early) XMOS chips, for example, run at 500 MHz with four threads or, if you needed more threads, you could also configure the system to run eight threads at half the speed IIRC. If you used e.g. three threads, some execution time remained unused in the four-thread mode, there was no arbitrary division of time by the number of threads.
For real-time critical systems, you could then still run up to seven critical threads at guaranteed speed and reserve the remaining one for non timing-critical tasks (which you could then to schedule using cooperative multitasking).
The RAM was a fast on-chip SRAM, so there were no problems with refresh, access latencies etc. that you have with DRAM. However, you were constrained to 64 kB RAM per core (probably not enough to run Doom...).
The XMOS development toolchain even includes a real-time analyzer for the C/C++ code you throw at it. Unfortunately, most of the XMOS toolchain is closed source.