Amazing write up and i wish more people showed the process for discovery which is often even more interesting than the result itself
Still the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob results
The academic literature seems to be catching up:
- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.
- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.
- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.
On the other papers, models like SOLAR or training a model that uses a single layers are probably going to hit a wall, based on the heatmaps I found. The transformer stack starts with randomised weights, (analogous to undifferentiated stem cells), and it seems they later form 'organs' during the trillions of pre-training tokens they undergo. My hypothesis is that you probably only want one copy of the 'token-to-thought', and 'thought-to-token' organs. It seems that you can make one layer do all three things (transforms in and out, and do the 'thinking'), but I think specialisation will always win.
I really like Clawdbots safety gloves off approach - no handholding or just saying yes to every permission.
I set it up on a old macbook pro I had that had a broken screen and it works great. Now I just message my server using telegram and it does research for me, organizes my notes, and builds small apps on the fly to help with learning.
However security is a real concern. I need to understand how to create a comprehensive set of allowlists before expanding into anything more serious like bill payments or messaging people / etc
Some great life lessons here, but also some I don't agree with:
- The lazy person works twice as hard.
Often I found you can save a lot of time just trying to the minimal possible and gain a lot of insights of why something is minimal vs not
-The opinion of the person who rarely offers it is listened to more closely.
I found the opposite to be true, those who don't offer their thoughts frequently are often dismissed when they do want to share something
Anyway, many of the points are great.. I would also add to keep a journal and write down what was meaningful throughout the day.. you will find time passing by with more quality since you know what the take and what to avoid
Just because it is in C, doesn't mean you will get C like performance. Just look at the benchmarks, it is 8x slower than just using PyTorch... while I get its cool to use LLMs to generate code at this level, getting super high performing optimized code is very much out of the domain of current frontier LLMs
The PyTorch version is using the GPU (with Metal Performance Shaders); this C version is currently using (in the docs I saw) a single CPU core, with AMX (via Apple Accelerate BLAS) but not yet with OpenMP for parallelism. It’s not slow because LLM code is bad, but because it’s not running on the same hardware. That said, it’s also not as fast as it is because of the LLM—all the critical code is in kernel libraries it calls (the same as for PyTorch).
Absolutely true, but now I'll focus on making it fast and I believe it will be possible to go much faster. I left the agent working in the night with a specification and now I'm going to see the progresses and restart the work.
No it’s not. I have written cuda kernels and 8bit optimizers with this.
They’re actually very good at speed optimization and can iterate very quickly taking notes on trials and failures and benchmarks. I’ve had it write 10 different attempts in around an hour and benchmark them all then merge and beat very strong baselines in torch
I really liked the approach of getting new topics to research via embeddings, trails, and claude code, but often what will this give you outside of novelty?
“Decompression” is a metaphor, not a fact claim to be proved; it is a description of an approach to generating a dataset from an LLM where most of the potential utility is still fairly explicitly speculative, a jumping off point for further work.
FWIW I have the €20 Pro plan and exchange maybe 20 messages with Opus (with thinking) every day, including one weeks-long conversation. Plus a few dozen Sonnet tasks and occasionally light weight CC.
I'm not a programmer, though - engineering manager.
Sure I do, but not as part of any tools, just for one-off conversations where I know it's going to be the best out there. For tasks where reasoning helps little to none, it's often still number one.
Still the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob results
The academic literature seems to be catching up:
- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.
- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.
- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.
reply