Maybe RL? Just like similar corrections in reasoning traces. You can train non-'thinking' models the same way (though if you're naive about it then you might end up with responses that are similarly rambly), and I'd expect it to have been
There isn't, though you can run it over wasm on it. I tried it a while back with a port of the w2c2 transpiler (https://github.com/euclaise/w2c9/), but something like wazero is a more obvious choice
This is not exactly propaganda in the typical sense, but it clearly is the case that people successfully edit Wikipedia to further objectives. As an example, the Wikipedia page for Meta-analysis (which isn't even that obscure of a topic) currently contains content that seems to plausibly be trying to promote Suhail Doi's methods, and it seems that it has been like this for a number of years. It cites 5 papers from him, more than anyone else, of which the largest has 297 citations. It has a subsection devoted to his method of meta-analysis, despite it being a rather obscure and rarely used method. There have been additional subsections added over time, which also focus on somewhat obscure areas, but frankly these additions are sketchy in similar ways.
In general, it is not uncommon to come across slantedness issues. Is it completely 100% clear that Doi has come on and maliciously added his papers? Not quite, but good propaganda wouldn't be either, and would actually be far less suspicious-looking.
Yes, but the claim is about "unlimited context length." I doubt attention over each segment can be as good at recall as attention over the full input context.
A lot of embedding models are built on top of T5's encoder, this offers a new option
The modularity of the enc-dec approach is useful - you can insert additional models in between (e.g. A diffusion model), you can use different encoders for different modalities, etc
There's a new 7B version that was trained on more tokens, with longer context, and there's now a 14B version that competes with Llama 34B in some benchmarks.