On the other hand, one techie with a few million dollars... And you could train ...

bitL · on Feb 21, 2023

Any larger crypto enthusiast with a bunch of 3090s and a solar farm can do it for nearly free (assuming fixed expenses paid by Eth already...)

paxys · on Feb 20, 2023

You can do it once, but probably not every day.

Ajedi32 · on Feb 20, 2023

Why would you want to retrain it from scratch every day? Stable Diffusion doesn't do that either.

baobabKoodaa · on Feb 20, 2023

Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development. If you actually want to keep developing the model, you need the funding to be able to train it more than once.

taneq · on Feb 21, 2023

To summarize this discussion, we went from "this might mean we don't need a fleet of $10k+ GPUs to even run a LLM" to "yeah but an individual couldn't train one every day though". These goalposts are breaking the sound barrier.

coldtea · on Feb 20, 2023

>but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development

This is not "software development" in general, this is LLM training.

It's not like you're building some regular app, api, or backend.

baobabKoodaa · on Feb 21, 2023

If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong. The researchers who created OPT didn't go into a basement for 12 months, then come out, train their model once, hit publish, and go to coffee. That is a fantasy. Likewise, if a CS student wants to dabble in this research, they need the ability to train more than once.

I'm not gonna engage in a rhetorical argument about whether this should be called "software development" or "LLM development" or something else. That's unrelated to the question of how much training is required.

coldtea · on Feb 21, 2023

>If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong.

No, I'm rather claiming that what you claimed is wrong in the context of LLM training: "Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development".

LLM training is not the same as writing a program and "running your code with different variations". For LLM you don't need to quickly rerun everything with some new corpus - it would be nice, but it's neither a prerequisite, not even crucial for any current use.

Hell, it's not even a "prerequisite" in programming, just good to have. Tons of great programs have been written with very slow build times, without quick edit/compile/build/run cycles.

baobabKoodaa · on Feb 21, 2023

I wasn't talking about running the same code with a new corpus. For that kind of use case one can simply fine tune the pretrained model. The example I gave was "if a CS student wants to dabble in this research".

You said "LLM training is not the same as writing a program and running your code with different variations". How do you think these LLMs were made, seriously? Do you think Facebook researchers sat down for 12 months and wrote code non stop without compiling it once, until the program was finished and was used to train the LLM literally only one time?

Dylan16807 · on Feb 22, 2023

I would expect them to use small sizes for almost all the testing.

baobabKoodaa · on Feb 22, 2023

Yes. There _is_ a need to train LLMs more than once, and training is prohibitively expensive, so you need workarounds such as training on a small subset of data, or a smaller version of the model. We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.

Dylan16807 · on Feb 22, 2023

> We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.

Okay. But I was saying someone with millions of dollars to spend could do it. And then another poster was arguing that millions of dollars was not enough to be viable because you need lots of repeated runs.

Nobody was saying a student could train one of these models from scratch. The cool potential is for a student to run one, maybe fine tune it.

baobabKoodaa · on Feb 23, 2023

Here is the upthread comment I was responding to:

> Why would you want to retrain it from scratch every day?

I was explaining why someone might want to retrain it more than once (although not literally every day).

paxys · on Feb 20, 2023

Because things happen every day. If ChatGPT wants to compete with Google, staying up to date with recent events is the minimum bar.

Ajedi32 · on Feb 20, 2023

You wouldn't need to re-train from scratch for that, just fine-tune on the new data sources. I don't think constant re-training is the optimal strategy for that use-case anyway. Bing does it by letting the LLM search a more traditional web index to find the information it needs.

paxys · on Feb 20, 2023

Okay but someone has to do the fine tuning. The code has to be updated. Parts of the training have to be redone. All of this has costs. It isn't a "do it once and forget about it" task that it is being touted as in this thread.

coldtea · on Feb 20, 2023

>The code has to be updated

I'm pretty sure this is not how an LLM works.

>It isn't a "do it once and forget about it" task that it is being touted as in this thread.

That's neither here, nor there. Training the LLM itself is not a "do it multiple times per day if you want to compete with Google" thing as it has been stated in this subthread.

baobabKoodaa · on Feb 23, 2023

> > The code has to be updated

> I'm pretty sure this is not how an LLM works.

You can say that about any software. "You can use this software perfectly well without ever updating it." Sure, you can do that, but typically people have lots of reasons to update software. LLM isn't magic in this sense. An LLM does not mysteriously update its own code if you just wish hard enough. If you want to continue the development of the LLM then you need to make changes to the code, just like with any other software.

coldtea · on Feb 20, 2023

That's not what the training is about.

Things happen everyday, but languages and words and their associations don't change in any measurable way every day...

This is not like web crawling...

simonw · on Feb 20, 2023

That's not necessary. Look at how a Bing works: it's a LLM which can trigger searches, and then gets fed the search results back to it as part of the prompt.

I wrote about one way to implement that pattern here: https://simonwillison.net/2023/Jan/13/semantic-search-answer...

idiotsecant · on Feb 20, 2023

Is there information out there about how much it cost (in time or human-hours) to do the additional training necessary to make chatGPT? I am genuinely curious what the scale of the effort was.

mensetmanusman · on Feb 20, 2023

VCs are funding $100,000,000 AI compute efforts now, so it might be something like that.

Nuzzerino · on Feb 20, 2023

I would hope publicity isn’t the motivation for doing it though.

idiotsecant · on Feb 20, 2023

What motivation would be sufficiently noble?

Nuzzerino · on Feb 20, 2023

Probably one where there isn't an intrinsic conflict of interest with AI risk. Or from a more traditional angle, one where the author's vanity isn't required to be appeased in order for users/customers to be happy. I'm of the opinion that you should do something with game-changing technology because the world needs it, not because you need an ego boost. All technology brings side effects, and there is no greater example of that than "democratized" AI...

idiotsecant · on Feb 20, 2023

People often (usually) do objectively useful things because it's in their selfish interests to do so, ego or otherwise. The surest road to failure is expecting people to act virtuously. Generally systems that assume virtue fail, and systems that assume selfish action and steer that selfish action towards the greater good succeed.

In other words, I don't care why people do things, only that they do.

Nuzzerino · on Feb 20, 2023

That’s fine, as long as publicity isn’t the motivation. It’s safe to assume that isn’t optimal for a projects success (Satoshi understood this). Not sure where you got the idea that the inverse of that was beneficial to such a project. I’ve seen first hand where it becomes a problem.

I’m not aware of many examples of starry-eyed divas achieving great results. Usually you hear about them but only because they are exceptional cases, not the norm. It’s a matter of practicality and not virtue (to say otherwise is purely a straw man argument).

JoshCole · on Feb 21, 2023

> to say otherwise is purely a straw man argument

This is really overconfident.

That publicity isn't causally connected to success is belied by the existence of the advertising industry. While generally refuting across industries, it is worth noting that the most dominant AI company - Google - happens to be in this industry. They are explicitly known - having publicity for - their generous compensation packages. This is because of a causal model of talent attraction.

Success is obviously causally connected to publicity and the idea that it isn't isn't well supported by the evidence. Contrary to your assertion, it was not a safe assumption. Your appeal to Satoshi is an appeal to authority, not a causal model of its shielding off from project impacts.

Nuzzerino · on Feb 21, 2023

> That publicity isn't causally connected to success is belied by the existence of the advertising industry.

The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.

To phrase it plainly: Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about? Or those looking for money or fame? Last time I checked the latter case would be seen as a strong negative. The appeal to authority argument doesn’t apply in this situation because the VC portfolio performance is causally related to how accurately they predict future success of a company.

On the scale of a smaller project like this, a common failure mode is for a maintainer to stop caring about the project and go to the next thing that motivates them. Someone else may attempt to use the code or project without understanding the theory behind it. And even worse: every time this happens is a signal that this is acceptable.

AI is a different beast. Software bugs with big AI systems will become more costly, and eventually deadly. Unfortunately I’m not sure what can be done about it without a global totalitarian regime to ban its use entirely (which is not an idea I support anyway). Eventually the broken clock will be right and some profit-driven AI project will succeed in making the world a not better place, if we are even around to notice :).

I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.

I think someone should train ChatGPT or similar to argue or teach traditional AGI Philosophy/Ethics and hopefully that will move the needle somewhat more than the OpenAI nannyism we have now.

JoshCole · on Feb 21, 2023

> The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.

That the causal model supports publicity seeking leads us to ensemble models. When models are good for different reason, the ensemble of the models ends up better than any individual model. Reinforcement learning research has shown you can successfully train an agent out of decomposed reward signals by building an ensemble model atop them.

The fact that the causality says publicity matters means that agents which recognize the importance that publicity contribute to the solution actually do have the expectation of being part of the solution.

It is very common to see this talked about in terms of diversity improving solution quality when talking about it in the context of companies and it is generally considered a good idea to have a diverse team as a consequence.

Anyway, I'm mostly responding because I disagree with apriori declaration that all who disagree are attacking a straw man.

I think that was overconfident, because the causal structure of publicity and its relation to outcomes disagrees with that.

JoshCole · on Feb 21, 2023

> Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about?

It is worth reflecting on the fact that the founder of OpenAI has had the strongest possible endorsement from Paul Graham. He was claimed to be among the greats before his successes: Paul Graham put him among Steve Jobs and Elon Musk. When Paul Graham stepped down from YCombinator, he was so convinced of Sam's skills that he put Sam in his place. Later Sam started OpenAI.

> I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.

I've read Superintelligence, the Sequences, PAIP, AIMA, Deep Learning, Reinforcement Learning, and Theory of Games and Economic Behavior, taken a course in control theory, and read a book about evolutionary algorithms. I've also built systems after having understood these techniques for literally each of these things I've mentioned with the exception of all of Superintelligence and much of the Sequences with the exception of parts of the sequences which dealt with Bayesian reasoning, which I did implement and like, though I disagree with that community about its optimality because the conditions of ledger arguments aren't true in the real world. In practice, Bayesian approaches are like trying to build a sportscar for a race - you get beaten even though you are doing the fastest thing, because the fastest thing isn't as fast as the slower methods.

Anyway, the combinatorics of multi-step multi-agent decision problems implies a lot of problems for Bostrom and Yudowsky positions on the limits of what intelligence can hope to achieve. I don't find them to be the most formidable thinkers on this subject. In the case of Yudowsky, he admits this, saying that he finds Norvig to be more formidable than he is. And Norvig disagreed with him on AI risk in exactly the context I also disagree and for the same reason I disagree. To ensure you get the point I'll speak in terms of Bostrom's analogies: notice, there is, in fact, a speed limit. The speed of light. Well, what Norvig notices, and what I also notice, and what Bellman noticed when he coined the term combinatorial explosion, is that intractability is an actual issue that you need to confront. It isn't something you can hand wave away with analogy. We don't have enough atoms in our universe.

This is why we get dual mode systems by the way. Not just humans: notice, it happens in chess engines too. The general solvers provides the heuristic which must have error, then the specific solver uses the heuristic to improve, because it is in a more specific situation. Most of the people in the AI risk camp are pretty Yudowskian. They dwell for long periods of time on the overcoming of the biased heuristic. For sure, this makes them more intelligent, but it misinforms them when they try to make inference about general intelligence informed on the tractability in specific situations. It is because, not despite, the intractability that they find such evidence of tractability.

Nuzzerino · on Feb 22, 2023

I'll have to check out Bellman's work, thanks!

JoshCole · on Feb 22, 2023

BTW, Bellman actually coined the term curse of dimensionality [1]; got that confused with combinatorial explosion since it is a synonyms in the contexts I typically encounter it [2].

[1]: https://en.wikipedia.org/wiki/Curse_of_dimensionality

[2]: https://en.wikipedia.org/wiki/Combinatorial_explosion

OpenAI has a pretty good introduction to the Bellman equations in their Spinning Up in RL lessons [3]. Sutton's work in Reinforcement Learning also talks about Bellman's work quite a bit. Though Bellman was actually studying what he called dynamic programming problems his work is now considered foundational in reinforcement learning.

[3]: https://spinningup.openai.com/en/latest/

Uh, and for the dual mode observations the person that brought that to my attention was Noam Brown, not Bellman or Norvig. If you haven't already checked out his work, I recommend it above both Norvig and Bellman. He has some great talks on Youtube and I consider it a shame they aren't more widely viewed [4].

[4]: https://www.youtube.com/watch?v=cn8Sld4xQjg

strohwueste · on Feb 20, 2023

The summary includes a dangerous thought: For example: why does north Korea develop a nuclear bomb is not important, just that they do. But only they why makes it problematic.

idiotsecant · on Feb 21, 2023

I'm not sure who was advocating for North Korea making nuclear weapons in this exchange.

IncRnd · on Feb 20, 2023

Noble? You're anthropomorphising machine learning. On possible motiviation would be to train a model, instead of training a model in order to create publicity around a model being trained.

idiotsecant · on Feb 20, 2023

I think you're misreading, nobody is anthropomorphizing anything other than the very 'anthro' component of the system we're talking about - the people distributing the funding.

IncRnd · on Feb 20, 2023

I may have misread your comment, then. Either way, thank you for the explanation!