Most interpretability methods fail for LLMs because they try to explain outputs ...

adebayoj · 2026-02-24T07:56:35 1771919795

op here, I mostly agree with your comment! However, our model does more than this. For any chunk the model generates, it can answer: which concept, in the model's representations, was responsible for that token(s). In fact, we can answer the question: what training data caused the model to be generated too! We force this to be a constraint as part of the architecture and the loss function for our you train the model. In fact, you can get are the high level reasons for a model's answer on complex problems.

codeflo · 2026-02-24T08:14:33 1771920873

All of the examples on the linked page seem to be "good" outputs. Attribution sounds most useful to me in cases where an LLM produces the typical kind of garbage response: wrong information in the training data, hallucinations, sycophancy, over-eagerly pattern matching to unasked but similar, well-known questions. Can you give an example of a bad output, and show what the attribution tells us?

adebayoj · 2026-02-24T08:32:29 1771921949

You got it exactly right. Guilty as charged. Over the coming weeks, we will be showcasing exactly how you can debug all of these examples.

I agree that attribution is most useful for debugging and auditing. This is a prime usecase for us. We have a post with exciting results lined up to do this. Should be out in a week, we wanted to even just get the initial model out :)

Grimblewald · 2026-02-24T11:25:02 1771932302

What I am reading here is that when the model is wrong, it still (at least sometimes) confidently attributes the answer to some knwoledge base, is that correct? If that is the case, how is this different to simply predicting the vibe of a given corpus and assinging provenance to it? Much less impressive imo and something most models can do without explicit training. All precision no recall as it were.

gchamonlive · 2026-02-24T11:58:17 1771934297

I think this was answered before, with the constraints of the architecture of the model. You can't expect something fundamentally different from an LLM, because that's how they work. It's different from other models because they were not designed for this. Maybe you were expecting more, but that's not OP's fault or demerit.

Grimblewald · 2026-02-24T12:28:06 1771936086

What you're saying fits my understanding/expectations. However the post and the user I am replying to seem to imply different. This makes me wonder, is my understanding incomplete or is this post marketing hype dressed up as insight? So I am asking for transparency.