I'd agree that this effect is probably mainly due to architectural parameters such as the number and dimensions of heads, and hidden dimension. But not so much the model size (number of parameters) or less training.
Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.
Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.