Hacker Newsnew | past | comments | ask | show | jobs | submit | solenoid0937's commentslogin

Weird, I have had the opposite experience. Codex is good at doing precisely what I tell it to do, Opus suggests well thought out plans even if it needs to push back to do it.

This is just the stochastic nature of LLM's at play. I think all of the SOTA models are roughly equivalent, but without enough samples people end up reading into it too much.

There's a certain amount of variance in the way that people utilize these agents. Put five people in a room and ask them to compose the same prompt and you have five distinct prompts. Couple this with the fact that models respond better/worse to certain prompts depending on the stylistic composition of the prompt itself. And since people tend to write in the same style, you'd get people who have more luck with one model over another, where one model happens to align more readily with their prompt style.

To wit, I have noticed that I tend to prefer Codex's output for planning and review, but Opus for implementation; this is inverted from others at work.


> Couple this with the fact that models respond better/worse to certain prompts depending on the stylistic composition of the prompt itself.

Do we really know this, or is it just gut feeling? Did somebody really proved this statistically with a great certainty?


I used to feel like you do, but I don't agree. I would just say it is not consistent. For a given codebase and given goal, sometimes Claude will be the more sensible, creative, thoughtful planner and sometimes Codex will be, sometimes Claude will make a serious oversight that Codex catches and sometimes the opposite. But the trend for me and seemingly a lot of people is that Claude is a more "human-like/human-smart" planner than Codex (in a positive way) but is more likely to make mistakes or forget details when implementing major codebase changes.

This is a total non issue if they hold their red lines, which they clearly intend to if you've read the memo.

Also if you've ever actually chatted with anyone at the company you'd know that they are not all the same and Anthropic genuinely does stand apart here.


"which they clearly intend to if you've read the memo"

Do memos have special magic properties or something? What am I missing here?


Non issue if they keep their red lines.

First, I have to say I loved your thoughtful & detailed comment. You have clearly considered this from the financial side; let me add some color from the perspective of someone working with frontier researchers.

> As the "alignment" folks on the AI industry are likely to learn

I will push back here. Dario & co are not starry-eyed naive idealists as implied. This is a calculated decision to maximize their goal (safe AGI/ASI.)

You have the right philosophy on the balance sheet side of things, but what you're missing is that researchers are more valuable than any military spend or any datacenter.

It does not matter how many hundreds of billions you have - if the 500-1000 top researchers don't want to work for you, you're fucked; and if they do, you will win because these are the people that come up with the step-change improvements in capability.

There is no substitute for sheer IQ:

- You can't buy it (god knows Zuck has tried, and failed to earn their respect).

- You can't build it (yet.)

- And collaboration amongst less intelligent people does not reliably achieve the requisite "Eureka" realizations.

Had Anthropic gone forth with the DoD contract, they would have lost this top crowd, crippling the firm. On the other hand, by rejecting the contract, Anthropic's recruiting just got much easier (and OAI's much harder).

Generally, the defense crowd have a somewhat inflated sense of self worth. Yes, there's a lot of money, but very few highly intelligent people want to work for them. (Almost no top talent wants to work for Palantir, despite the pay.) So, naturally:

- If OpenAI becomes a glorified military contractor, they will bleed talent.

- Top talent's low trust in the government means Manhattan Project-style collaborations are dead in the water.

As such, AGI will likely emerge from a private enterprise effort that is not heavily militarized.

Finally, the Anthropic restrictions will last, what, 2.5 more years? They are being locked out of a narrow subset of usecases (DoD contract work only - vendors can still use it for all other work - Hegseth's reading of SCR is incorrect) and have farmed massive reputation gains for both top talent and the next administration.


This is an interesting perspective. What happens if there is a large global war? Do researchers who were previously against working with the DoD end up flipping out of duty? Does the war budget go up? Does the DoD decide to lift any ban on Anthropic for the sake of getting the best model and does Anthropic warm its stance on not working with autonomous weapons systems?

I don’t know the answers to these questions, but if the answer is “yes” to at least 1 or 2, then I think the equation flips quite a bit. This is what I’m seeing in the world right now, and it’s disconcerting:

1. Ukraine and Russia have been in a skirmish that has been drawn out much longer than I would guess most people would have guessed. This has created a divide in political allegiance within the United States and Europe.

2. We captured the leader of Venezuela. Cuba is now scared they are next.

3. We just bombed Iran and killed their supreme leader.

4. China and the US are, of course, in a massive economic race for world power supremacy. The tensions have been steadily rising, and they are now feeling the pressure of oil exports from Iran grinding to a halt.

5. The past couple days Macron has been trying to quell tension between Israel and Lebanon.

I really do not hope we are not headed into war. I hope the fact that we all have nukes and rely on each others’ supply chains deters one. But man does it feel like the odds are increasing in favor of one, and man does that seem to throw a wrench in this whole thing with Anthropic vs. OpenAI.


> 3. We just bombed Iran and killed their supreme leader.

Being accurate, by all reporting Israel killed Iran's leadership.

Yes, likely enabled by US intelligence, but the one who pulls the trigger does matter.


"We" here clearly means USA+israel. There isn't a distinction between the two when they're working towards the same goals, bombing everything in sight, together.

The one who pulled the trigger is irrelevant here, because both have pulled the trigger hundreds or thousands of times in the past few days, dividing up targets between them for the joint operation.


Given that direct assassination is still prohibited by EO 11905 / 12036 / 12333, it's a major issue if the US president ordered the strike or not.

I'm aware that internet forums like to play fast and loose with insinuations, but facts are facts.


> Given that direct assassination is still prohibited by EO 11905 / 12036 / 12333

It sounds like you think this means something?

Obviously it doesn't when we're talking about an administration that openly breaks laws, much less EOs, and issues whatever EOs they want saying whatever they want, even in violation of previous EOs. There aren't even any repercussions to the president "violating an EO".

So, the pedantry here is irrelevant. The two parties are on the same team, working towards the same goal, doing the same things, divvying up the list of targets to strike.


> It sounds like you think this means something?

If you'd rather talk with yourself, I'll see myself out of this convo. No time for folks who would rather indulge in hyperbole than messy reality.


Given that you totally ignored the substance of my post, and instead focused on attacking me personally, it does seem like you're not interested in a discussion, and not a good fit for the HN culture and guidelines. So yeah, maybe you are right and it would be better if you left.

But! That's not who you always have to be! I'm confident you can coherently articulate your point without resorting to that. Feel free to come back if you're willing to share why you feel the president not complying with a presidential executive order is significant here, rather than insignificant.

Anyways, happy friday!


that is considering if there will be elections, which many people don't believe it's the case.

reminder that trump has been flirting with just continuing in power (2028 hats and talks about a third term) and is responsible for trying a coup last time he lost.

personally I think there's a possibility where he'll just declare martial law and stay in power at the end of his term.


> researchers are more valuable than any military spend or any datacenter. It does not matter how many hundreds of billions you have - if the 500-1000 top researchers don't want to work for you, you're fucked; and if they do, you will win because these are the people that come up with the step-change improvements in capability.

This is a massive cope imo. The reason that the AI industry is so incestuous is just because there are only a handful of frontier labs with the compute/capital to run large training clusters.

Most of the improvements that we’ve seen in the past 3 years are due to significantly better hardware and software, just boring and straightforward engineering work, not brilliant model architecture improvements. We are running transformers from 2017. The brilliant researchers at the frontier labs have not produced a successor architecture in nearly a decade of trying. That’s not what winning on research looks like.

Have there been some step-change improvements? Sure. But by far the biggest improvement can be attributed to training bigger models on more badass hardware, and hardware availability to serve it cheaply. To act like the DoD isn’t going to be able to stand up pytorch or vllm and get a decent result is hilarious: the reason you use slurm and MPI and openshmem is because national labs and DoD were using it first. NCCL is just gpu accelerated scope-reduced MPI. nvshmem is just gpu accelerated scope-reduced openshmem.

If anything, DoD doesn’t have the inference throughput requirements that the unicorns have and might just be able to immediately outperform them by training a massive dense model without optimizing for time to first token or throughput. They don’t have to worry about if the $/1M tokens makes it economically feasible to serve, which is a primary consideration of the unicorns today when they’re choosing their parameter counts. They can just rate limit the endpoint and share it, with a 2 hour queue time.

The government invented HPC, it’s their world and you’re just playing in it.

> Generally, the defense crowd have a somewhat inflated sense of self worth.

/eyeroll but nobody can do what you do!


Sure the architecture is from 2017. But the gap between GPT-1 and frontier models today is not simply "more FLOPs" and as simple as "standing up PyTorch and vllm" - theres thousands of undocumented decisions about data, alignment, reward modeling, training stability, and inference-time strategies, and lots of tribal knowledge held by a small group of people who overwhelmingly do not want to work on weapons systems.

The dense model argument is self-defeating long term. Sparsity (MoE etc.) lets you build a smarter model at the same compute budget, so going dense because you can afford to waste FLOPs is how you fall behind b/c you never came up with the step function improvements needed.

Sure, the DoD invented HPC, but it also invented the internet, and then the private sector made it actually useful.


I have noticed this too, despite the close benchmark results Claude just works better. It knows when to push back, it has an "agency"... there is something there that I don't see with Gemini or OpenAI's best paid models.

They engage with Palantir for non-domestic purposes.

"Non-domestic purposes" specifically includes wiretapping US citizens and residents, and has for at least 25 years:

https://en.wikipedia.org/wiki/NSA_warrantless_surveillance_(...

I suspect the 2007 in the title refers to the fact that bills were passed to ban this stuff in 2007, which is when the PRISM program (also illegal domestic surveillance) got started.

(The title makes it sound like warrantless surveillance lasted from 2001-2007, but I think it means the article only covers that date range.)


Could you please elaborate on why the pedestal is "undeserved" when they are willing to stick up for their principals at the expense of being designated a SCR?

Could you point me to one other $300B+ company that would be willing to do this?


https://time.com/7380854/exclusive-anthropic-drops-flagship-...

https://news.ycombinator.com/item?id=47145963

Just trying to make sure folks aren't getting ahead of themselves, without having put some custom thought into it.

If you want to put them on a pedestal for reasons that make sense to you, all good.

If others are encouraged to form their own opinions by taking some pause for thought, then all the better.

If Anthropic still end up on the pedestal, it must be for the right reasons, as opposed to 'just because they're not the currently discussed villain'.


What a cute statement given that they orchestrated this with a $25M donation to Trump and starting negotiations well before all this blew up: https://garymarcus.substack.com/p/the-whole-thing-was-scam


I guarantee you that most OAI employees have well into the multiple millions at this point.

There is no "financial safety net" they need to care about. That is just an excuse.


Haven’t most been hired in the last year?


It's not sycophantic and has a much better "voice."


Not that I’m a friend of OpenAI, but ChatGPT has relatively fine-grained “personalization” options, and it was never sycophantic with the “efficient” tone for me. Rather the opposite, sometimes it seemed slightly indignant when I criticized it.


It definitely is sycophantic, but uses A LOT less emoji, lists and header-paragraph-header structure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: