There's also a reward for not over thinking it and letting AI bring the solutions to you. The outcomes are better when it's a question, answer, and execution session.
I admit that I don't know who Tyler Cowen is, but millions (billions?) of people have drunk coffee daily for centuries and if there were ill effects in the same ballpark as opioids or tobacco by now we would certainly know?
There is even a decent chance that the Industrial Revolution and the phenomenal wealth and progress it's brought was caused by the introduction of coffee to Europe.
A professor of economics has opinions on the health effects of an extremely common substance?
And I have opinions on nuclear energy - but neither of us are worth listening to outside our areas of expertise. Unless you can supply a reason I would bother listening to him as compared to an actual expert on the subject?
Because some dude with no health or nutrition background said uninformed things, that he isn't qualified to have opinions about, on the internet? Come on, now.
I regularly run the same prompts twice and through different models. Particularly, when making changes to agent metadata like agent files or skills.
At least weekly I run a set of prompts to compare codex/claude against each other. This is quite easy the prompt sessions are just text files that are saved.
The problem is doing it enough for statistical significance and judging the output as better or not.
I suspect you may not be writing code regularly...
If I have to ask Claude the same things three times and it keeps saying "You are right, now I've implemented it!" and the code is still missing 1 out of 3 things or worse, then I can definitely say the model has become worse (since this wasn't happening before).
I haven't experiences this with gpt-5.3-codex (xhigh) for example. Opus/Sonnet usually work well when just released, then they degrade quite regularly. I know the prompts are not the same every day or even across the day, but if the type of problems are always the same (at least in my case) and a model starts doing stupid things, then it means something is wrong. Everyone I know who uses Claude regularly, usually have the same esperience whenever I notice they degrade.
When I use Claude daily (both professionally and personally with a Max subscription), there are things that it does differently between 4.5 and 4.6. It's hard to point to any single conversation, but in aggregate I'm finding that certain tasks don't go as smoothly as they used to. In my view, Opus 4.6 is a lot better at long standing conversations (which has value), but does worse with critical details within smaller conversations.
A few things I've noticed:
* 4.6 doesn't look at certain files that it use to
* 4.6 tends to jump into writing code before it's fully understood the problem (annoying but promptable)
* 4.6 is less likely to do research, write to artifacts, or make external tool calls unless you specifically ask it to
* 4.6 is much more likely to ask annoying (blocking) questions that it can reasonably figure out on it's own
* 4.6 is much more likely to miss a critical detail in a planning document after being explicitly told to plan for that detail
* 4.6 needs to more proactively write its memories to file within a conversation to avoid going off track
* 4.6 is a lot worse about demonstrating critical details. I'm so tired of it explaining something conceptually without it thinking about how it implements details.
Just hit a situation where 4.6 is driving me crazy.
I'm working through a refactor and I explicitly told it to use a block (as in Ruby Blocks) and it completely overlooked that. Totally missed it as something I asked it to do.
Cervical radiculopathy can cause shoulder pain. I have experienced this quite a bit and it's probably also because of my sleeping style. I wouldn't get an MRI unless I was planning to have surgery.
I'll second this. I used opencode + opus 4.6 + ghidra to reverse engineer a seedkey generation algorithm[1] from v850 assembly. I gave it the binary, the known address for the generation function, and a set of known inputs/outputs, and it was able to crack it.
It is really good at highlighting my core flaw, marketing. I can ship stuff great, i feel insanely productive, and then i just hit a wall when it comes to marketing and move on to the next thing and repeat.
I think this is more aimed at the people who talk to AI like it is a person, or use it to confirm their own biases, which is painfully easy to do, and should be seen as a massive flaw.
For every one person who prompts AI intentionally to garner unbiased insights and avoid the sycophancy by pretending to be a person removed from the issue, who knows how many are unaware that is even a thing to do.
There was no mental health crisis, it was a bank account crisis. As in, "I sold my options on the secondary market, and those numbers on my bank statement are now so large I'm scared to stay at my job!" It was no secret what they were signing up for, so I find it too convenient that Anthropic raises a bunch of money, and suddenly this person has an ethical crisis.
reply