Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Steerling-8B, a language model that can explain any token it generates (guidelabs.ai)
171 points by adebayoj 10 hours ago | hide | past | favorite | 40 comments
 help



Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.

op here, I mostly agree with your comment! However, our model does more than this. For any chunk the model generates, it can answer: which concept, in the model's representations, was responsible for that token(s). In fact, we can answer the question: what training data caused the model to be generated too! We force this to be a constraint as part of the architecture and the loss function for our you train the model. In fact, you can get are the high level reasons for a model's answer on complex problems.

All of the examples on the linked page seem to be "good" outputs. Attribution sounds most useful to me in cases where an LLM produces the typical kind of garbage response: wrong information in the training data, hallucinations, sycophancy, over-eagerly pattern matching to unasked but similar, well-known questions. Can you give an example of a bad output, and show what the attribution tells us?

You got it exactly right. Guilty as charged. Over the coming weeks, we will be showcasing exactly how you can debug all of these examples.

I agree that attribution is most useful for debugging and auditing. This is a prime usecase for us. We have a post with exciting results lined up to do this. Should be out in a week, we wanted to even just get the initial model out :)


It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.

op here. Important point, but I disagree. We see explainability/interpretability as a CORE need for AI safety. We believe you can't align/audit/debug/fix a system that you don't understand.

Just to give you some answers for what we can do:

1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.

Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .

For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!


This is fantastic to read. LLMs feel like black boxes and for the large ones especially I have a sense they genuinely form concepts. Yet the internals were opaque. I remember reading how LLMs cannot explain their own behaviour when asked.

I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.


Our decomposition allows us to answer question like: for 84 percent of the model's representation, we know it is relying on this concept to give an answer.

We can also trace its behavior to the training data that led to it, so that can show us where some of these concepts are formed from.


Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.

I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)

[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...


Yes, that is the post that has the most up to date details of the model architecture. Take a look at this: https://github.com/guidelabs/steerling. It has the scaffolding for what you need :)

This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).

You are exactly right, it is guiding the model, during training, with concepts and the dictionary. This is important because dictionary learning for interpretability (post hoc) is not currently reliable: https://www.arxiv.org/abs/2602.14111

Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.

[1] https://shap.readthedocs.io/en/latest/


SHAP would be absurdly expensive to do for even tiny models (naive SHAP scales exponentially in the number of parameters; you can sample your coalitions to do better but those samples are going to be ridiculously sparse when you're talking about billions of parameters) and provides very little explanatory power for deep neural nets.

SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.

It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.

It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.

I don't know the specifics of what this particular model's approach is.

But SHAP unfortunately does not work for LLMs at all.


Completely agree with all your points!

Here is what this model does: it `rewrites` the model's activations (during pre-training) into supervised + unsupervised concepts that are then decoded into tokens. So at pre-training, we constrained the model with 33k supervised concepts (e.g., sports, toxicity, alignment, demographic variables), and then have more (101k) unsupervised concepts for the model to learn as well.

Overall, the architecture and loss functions of this model allow you to answer the following questions: 1) Which token in the context caused a chunk (group of tokens) to be generated? 2) which high level concept (supervised or unsupervised) caused the 3) perhaps more interestingly, in a single forward pass, we can tell you which training chunk led to the output of the model as well.

We do all of this for the single steerling model which is 8B parameters trained on 1.5T tokens. First time any model of this scale has achieved this level of interpretability by design.

would be happy to answer more questions.


This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.

Most interpretability techniques haven't yet to be shown to be useful for everyday model pipelines. However, the field is working hard to change this.

Maybe I’m not creative enough to see the potential, but what value does this bring ?

Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?

I find the LLM outputs are subtlety wrong not obviously wrong


It makes the black box slightly more transparent. Knowing more in this regard allows us to be more precise—you go from prompt tweak witchcraft and divination to more of possible science and precise method.

Can this method be extended to go down to the sentence level ?

In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?


Your question should be "Can it drill down to show the paragraphs or sentences that influence the answer?"

I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.


The example on the website shows one to many as well: Wikipedia, axive article, etc along with a ratio how much it influences the chunk of the answer.

Exactly! We will have a future post that shows this more granularly over the coming weeks. Here is a post we wrote on how this works at smaller scale: https://www.guidelabs.ai/post/prism/

Great questions. We have several posts in the works that will drill down more into these things. The model was actually designed to answer these questions for any sentence (or group of tokens it generates).

It can tell you which specific text (chunk) in the training data that led to the output the model generated. We plan to show more concrete demos of this capability over the coming weeks.

It can tell you where in the model's representation it learned about science, art, religion etc. And you can trace all of these to either to input context, training data, or model's representations.


the practical value here is for regulated domains. in healthcare and finance you often cant deploy a model at all unless you can explain why it made a specific decision. token-level attribution that traces back to training data sources could satisfy audit requirements that currently block LLM adoption entirely.

curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.


Good point. Historically, people have thought that there is a interpretability vs quality/performance tax. This is not true; at least not in this case.

Here are a bunch of questions you can answer without any quality degradation with interpretable models: 1) what part of the input context led to the output chunk that the model generated? 2) what part of the training data led to the output chunk?

In this case, we go more invasive, and actually constrain the model to also use human understandable concepts in its representations. You might think this leads to quality trade-offs. However, if you allow for the model to discover its own concepts as well (as long as they are not duplicates of the concepts you provided it), you don't see huge degradation.

I agree with the other commenters that this now gives us a huge boost in debugging the model.


the quality tax framing might actually undersell the value in regulated domains. if a hospital system can't deploy without explainability, a model that scores 95% and can trace its reasoning beats one that scores 97% and can't. the baseline isn't 'interpretable model vs better model' -- it's 'interpretable model vs no model at all.'

in the "Performance" section of the post: https://www.guidelabs.ai/post/steerling-8b-base-model-releas..., the authors show the model lags behind llama 8b but worth noting that llama 8b trained on > 2x more computes (see the FLOPs axis)

Thanks for pointing this out. LLama 3 8B was trained on ~15T tokens. The Qwen models on 15-18T tokens as well. We trained on 1.35T tokens, and are within shot of these models on benchmarks. We expect to, at the very minimum, match these models' performance when we scale our token budget.

One side effect that we are excited about is that interpretable model training might make for a data efficient training process.


If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.

I doubt that a regulator would be satisfied by the kinds of explanations this provides and the interventions it enables.

Suppose somebody put an LLM in charge of an industrial control system and it increased the temperature so much that it caused an accident. The input feature attribution analysis shows that the model was strongly influenced by the tokens describing the temperature control mechanism, concept attribution shows that it output tokens related to temperature, industrial processes and LLM tool-call syntax.

The operator proposes to fix this by rewriting the description and downweighting the temperature concept in the output, and a simulation shows that with these changes the model doesn't make the same decisions in this situation anymore. Should the regulator accept this explanation as sufficient to establish that the system is now safe?

If the controller has just a few parameters and responds approximately linearly to changes in its inputs, you can in principle guarantee that it'll stay within a safe zone. But LLMs have a huge number of parameters and by design highly nonlinear behavior. A simple explanation is unlikely to reflect model behavior accurately enough that you can trust its predictions to hold in arbitrary situations.


It does :) We constrained the model to do exactly this during training: https://www.guidelabs.ai/post/scaling-interpretable-models-8....

thanks for getting back to me, very cool if true :) I have been asked about this many times when talking LLM use cases at enterprise level. Would love to run som tests, pleas shoot me a message to the email in my profile.

sounds great! Will follow up via email.

Either I'm missing something or this is way overstated.

Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.

They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.

1: https://thezvi.substack.com/p/the-most-forbidden-technique


You are missing a few things, but you got some things right.

1) The is not an SAE in the way you think. It is a combination of a supervised + unsupervised layer that is constrained. An SAE is typically completely unsupervised, and applied post hoc. Here, we supervise 33k of the concepts with concepts that we carefully curated. We then have an unsupervised component (similar to a topk SAE) that we constrain to be independent from the supervised concepts. We don't do any of this post hoc by the way; this is a key constraint. I"ll get back to this. We train that unsupervised layer along with the model during pre-training.

2) Are the concepts or features causally influential for the output? We directly use the combination of the concepts for the lm head, which is a linear transform (with activation), so we can tell you, in closed form, the effect of ANY concept on the output logit for any token (or group of tokens) generated. It is not just causally related, it is constrained to do so.

3) Other points: we also make it so that you can trace the model outputs to the training data. This is an underrated interpretability knob. You know where, and what data, caused your model to learn a particular feature.

This is already a long comment, but I want to close on why our approach sidesteps all the issues with SAEs. - If you train an SAE twice, on the same data + model, you'll get two different feature(s). - In fact, there is no reason, why the model should pick features that are causally influential for the output. - ALL of these problems stem from the fact that the SAE is trained AFTER you already trained your model. Training from scratch AND with supervision allows you to sidestep these issues, and even learn more disentangled representations.

Happy to more concretely justify the above. Great observations!


Can you use this to decrease hallucinations?

It is impossible to completely get rid of hallucinations. However, this can tell you exactly why the model hallucinated.

Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.

We'll see.


Thanks, it is certainly a first step.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: