Either I'm missing something or this is way overstated.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
You are missing a few things, but you got some things right.
1) The is not an SAE in the way you think. It is a combination of a supervised + unsupervised layer that is constrained. An SAE is typically completely unsupervised, and applied post hoc. Here, we supervise 33k of the concepts with concepts that we carefully curated. We then have an unsupervised component (similar to a topk SAE) that we constrain to be independent from the supervised concepts. We don't do any of this post hoc by the way; this is a key constraint. I"ll get back to this. We train that unsupervised layer along with the model during pre-training.
2) Are the concepts or features causally influential for the output? We directly use the combination of the concepts for the lm head, which is a linear transform (with activation), so we can tell you, in closed form, the effect of ANY concept on the output logit for any token (or group of tokens) generated. It is not just causally related, it is constrained to do so.
3) Other points: we also make it so that you can trace the model outputs to the training data. This is an underrated interpretability knob. You know where, and what data, caused your model to learn a particular feature.
This is already a long comment, but I want to close on why our approach sidesteps all the issues with SAEs.
- If you train an SAE twice, on the same data + model, you'll get two different feature(s).
- In fact, there is no reason, why the model should pick features that are causally influential for the output.
- ALL of these problems stem from the fact that the SAE is trained AFTER you already trained your model. Training from scratch AND with supervision allows you to sidestep these issues, and even learn more disentangled representations.
Happy to more concretely justify the above. Great observations!
So... they can't actually "generate near-verbatim copies of novels"?
If they end a single sentence differently than the original, then the next sentence will be different and so on until you get a very different novel. Sure they could course-correct back towards the original plot, but it's going to be a challenge to stay on target when every third sentence is incorrect.
It's a good way to frame base models that have only been pretrained.
However, modern frontier models have undergone rounds of fine-tuning, RLHF (reinforcement learning from human feedback), and RLVR (RL from verifiable rewards) that turn them into something else. The compressed internet is still in there, but it's wrapped in problem-solving and people-pleasing circuitry.
Hacker News gets a lot less creepy/sad/interesting when you ignore the first-person pronouns and remember they're just biomolecular machines. It's a scaled up version of E. coli. Useful, sure, but there's no reason to ascribe emotions to it. It's just chemical chain reactions.
The only thing I know for sure is that I exist. Given that I exist, it makes sense to me that others of the same rough form as me also exist. My parents, friends, etc. Extrapolating further, it also makes sense to assume (pre-ai, bots) that most comments have a human consciousness behind them. Yes, humans are machines, but we're not just machines. So kindly sod off with that kind of comment.
But if you weren't one of them, would you be able to tell that they had emotions (and not just simulations of emotions) by looking at them from the outside?
If I wasn’t one of them I wouldn’t care. It’s like caring about trees having branches. They just do. The trees probably care a great deal about their branches though, like I care a great deal about my emotions.
Yes, my point was that people aren't better than machines, but just because I don't exceptionalize humanity doesn't mean I don't appreciate it for what it is (in fact I would argue that the lack of exceptionality makes us more profound).
I wouldn't proclaim a lack of exceptionality until we get human level AI. There could still be some secrets left in these squishy brains we carry around.
The goal of world models like Genie is to be a way for AI and robots to "imagine" things. Then, they could practice tasks inside of the simulated world or reason about actions by simulating their outcome.
reply