Hacker Newsnew | past | comments | ask | show | jobs | submit | galsapir's commentslogin

one of the most interesting pieces I've read recently. Not sure I agree with all the statements there (e.g. without execution the system has no comprehension) - but extremely cool

The part that made me laugh out loud: Dostoyevsky's description of medicine becoming "too specialized" — one doctor for the right nostril and another for the left. That's from a conversation between Ivan and the devil. Written in 1880. The rest of the novel is like that too — the narrator is semi-omniscient but explicitly unreliable and self-conscious about it, the characters' inner lives contradict their stated beliefs, and the psychological model overall is more sophisticated than most of what we use in social science today.

nice, esp. liked - "our memories, our thoughts, our designs should outlive the software we used to create them"

Weird. My memories and thoughts are not created by software.

this seems interesting, do you have an example of a use case that you found it helped with? (Red green pattern where without RUNE, it failed)?


Sure. I ran a direct comparison during development.

Without RUNE: I asked Claude to write a "validate_password" function twice in separate sessions. First time it required 1 special character and returned a bool. Second time it required 2 special characters and returned a tuple with the error message. Same prompt, same model, different behavior. I did the same with "validate_phone". In the first session it accepted dashes in the number, in the second session it rejected them. Completely different parsing logic. Now multiply that by a team where each developer uses a different AI tool.

With RUNE: I wrote specs for the same functions with explicit WHEN/THEN rules, edge cases, and expected error messages. I generated implementations in different sessions and compared them against reference implementations from weeks earlier. Same behavior every time. Not identical code, variable names and style vary, but the same contract. Same inputs produce the same outputs, same errors, same edge case handling.

What surprised me most was the spec validation step. It caught bugs before any code was written. In one case the BEHAVIOR section defined an error message with quotes around a parameter, but the TESTS section expected it without quotes. That kind of mismatch would normally show up as a flaky test in CI weeks later. The spec caught it at design time, before a single line of code existed.

That’s basically the value. Without RUNE the AI makes reasonable but inconsistent choices every time. With RUNE those choices are locked down in a spec, and the AI follows them. The spec becomes the source of truth instead of the model’s temperature.


I've been writing about how I use AI tools as an researcher working in health AI — specifically the tension between leveraging them and staying engaged enough to catch when they're wrong. This post is about a specific version of that problem: the models have gotten good enough that my default is to trust the output, and the threshold for "worth checking" keeps drifting upward. So I built a simple Claude Code skill that sends high-stakes work to a different model family for a second opinion — one call, not a multi-agent debate. The honest result: the first real test (reviewing an architecture spec) scored maybe 6/10. It caught one genuine security finding and missed the deeper domain questions entirely. That gap maps onto something I keep running into in evals — tools can check structural form (missing error handling, security anti-patterns) but struggle with essence (does this actually work the way the spec assumes? are the clinical guardrails robust?). Still worth it as a lightweight intervention against the drift toward not checking at all. The skill is open source if anyone wants to try or improve it.


haha nice for freelance work!


opus 4.6 came out yesterday so I tried it and built two things. i think the model is smoother: picks up intent faster, better questions in interview-style flows, more willing to loop for 8+ minutes. the tools: an interview command for claude code with depth checkpoints, and a markdown annotator for actually reviewing what comes back instead of staying in the "fix it plz" loop.


There's been a lot of discussion around this lately, especially after the Anthropic study. I don't have answers — this is more an attempt to articulate the problem and some mental frameworks that have been useful. Curious what practices others have found helpful


we spent a few months building evals for a health agent (and the agent itself!). tried to apply anthropic's framework to a real system looking at CGM data + diet. some of it worked. we got decent at checking form — citations exist, tools were called, numbers trace back. the harder part was essence — is this clinically appropriate? actually helpful? we didn't really solve that. curious if others building health/bio agents have found ways around this, or if everyone's just accepting fuzzy metrics for the stuff that matters.


foundation models in biology still haven't proven they're worth it vs simpler methods (imo). we just published one in Nature, and i feel i spent more time on "how will we know this worked" than on the model itself. the hard part was (mostly) deciding what success even means. open for thoughts


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: