Production question, then, for those who know about these things: how far ahead would Apple have locked in their prices for buying RAM for this line, for the units that are part of the initial release?
> It still has a splash screen and takes quite a long time to load, like an application from the 90s.
Lots of it is single-threaded, which is an endless frustration on a machine with umpteen cores. Especially frustrating given that it means compute happens on the UI thread.
I think you're an order of magnitude out. Motorola shipped 36.6 million handsets total across 2024. They seem to have had 33 handset models available in that period, and they were in profit, so the break-even point is presumably somewhere below 1.1M handsets.
If I'm off for the second group I'm probably also off for the first one. I'd be surprised if a purely privacy focused phone sells more than 200k units per year.
The architecture is also important: there's a trade-off for MoE. There used to be a rough rule of thumb that a 35bxa3b model would be equivalent in smarts to an 11b dense model, give or take, but that's not been accurate for a while.
It's not only "non-Chinese" to think about here. There's nobody really touching Qwen in the single-GPU size class and there hasn't been for a couple of generations.
I've got the unsloth q4_K_XL 35b running in llama.cpp on an i9/64G/4090 machine doing double-digit tokens per second with a 90k+ token context window available. The model's completely in VRAM.
There's also work on ternary models that's quite interesting, because the arithmetic operations are super fast and they're extremely cache efficient. Well worth looking into if that's the sort of thing that interests you.
reply