This will only happen if "Open"AI or other big orgs release the model weights, w...

gaogao · on Feb 20, 2023

Meta has released the model weights for OPT-175B, which is used in the paper. There's also a lot of full release LLMs from other labs on the way as well.

neilmovva · on Feb 20, 2023

While OPT-175B is great to have publicly available, it needs a lot more training to achieve good results. Meta trained OPT on 180B tokens, compared to 300B that GPT-3 saw. And the Chinchilla scaling laws suggest that almost 4T tokens would be required to get the most bang for compute buck.

And on top of that, there are some questions on the quality of open source data (The Pile) vs OpenAI’s proprietary dataset, which they seem to have spent a lot of effort cleaning. So: open source models are probably data-constrained, in both quantity and quality.

Miraste · on Feb 21, 2023

OPT-175B isn't publicly available, sadly. It's available to research institutions, which is much better than "Open"AI, but it doesn't help us hobbyists/indie researchers much.

generalizations · on Feb 21, 2023

I wonder when we'll start putting these models on the pirate bay or similar. Seems like an excellent use for the tech. Has no one tried to upload OPT-175B anywhere like that yet?

cma · on Feb 21, 2023

It could go on the clear net since trained weights aren't subject to copyright.

mensetmanusman · on Feb 20, 2023

It’s fun to think about a few billion weights being the difference between useless and gold.

tiborsaas · on Feb 20, 2023

Looking at my bank account I can relate :)

stavros · on Feb 20, 2023

Are there any that perform anywhere close to GPT-3?

jejeyyy77 · on Feb 21, 2023

Stable Diffusion was behind DALLE, but has since surpassed it. Don't hear anyone talking about DALLE anymore.

JoshCole · on Feb 20, 2023

No, Stable Diffusion isn't the only one to release their weights. OpenAI hasn't been releasing weights for ChatGPT, but Stable Diffusion isn't the only ones releasing weights [1].

[1]: https://huggingface.co/

permo-w · on Feb 20, 2023

yeah there an absolute pile of LLMs that are fully open-source. OpenAI’s GPT2 for one, but also Bloom, OPT, GPT-J, and I’m sure myriad others too

Miraste · on Feb 21, 2023

Nothing in the ballpark of GPT-3.5 or Prometheus yet, but we'll get there.

Dylan16807 · on Feb 20, 2023

On the other hand, one techie with a few million dollars...

And you could train something like GPT-3 for cheaper than a superbowl commercial. That would get you a lot of publicity.

bitL · on Feb 21, 2023

Any larger crypto enthusiast with a bunch of 3090s and a solar farm can do it for nearly free (assuming fixed expenses paid by Eth already...)

paxys · on Feb 20, 2023

You can do it once, but probably not every day.

Ajedi32 · on Feb 20, 2023

Why would you want to retrain it from scratch every day? Stable Diffusion doesn't do that either.

baobabKoodaa · on Feb 20, 2023

Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development. If you actually want to keep developing the model, you need the funding to be able to train it more than once.

taneq · on Feb 21, 2023

To summarize this discussion, we went from "this might mean we don't need a fleet of $10k+ GPUs to even run a LLM" to "yeah but an individual couldn't train one every day though". These goalposts are breaking the sound barrier.

coldtea · on Feb 20, 2023

>but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development

This is not "software development" in general, this is LLM training.

It's not like you're building some regular app, api, or backend.

baobabKoodaa · on Feb 21, 2023

If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong. The researchers who created OPT didn't go into a basement for 12 months, then come out, train their model once, hit publish, and go to coffee. That is a fantasy. Likewise, if a CS student wants to dabble in this research, they need the ability to train more than once.

I'm not gonna engage in a rhetorical argument about whether this should be called "software development" or "LLM development" or something else. That's unrelated to the question of how much training is required.

coldtea · on Feb 21, 2023

>If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong.

No, I'm rather claiming that what you claimed is wrong in the context of LLM training: "Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development".

LLM training is not the same as writing a program and "running your code with different variations". For LLM you don't need to quickly rerun everything with some new corpus - it would be nice, but it's neither a prerequisite, not even crucial for any current use.

Hell, it's not even a "prerequisite" in programming, just good to have. Tons of great programs have been written with very slow build times, without quick edit/compile/build/run cycles.

baobabKoodaa · on Feb 21, 2023

I wasn't talking about running the same code with a new corpus. For that kind of use case one can simply fine tune the pretrained model. The example I gave was "if a CS student wants to dabble in this research".

You said "LLM training is not the same as writing a program and running your code with different variations". How do you think these LLMs were made, seriously? Do you think Facebook researchers sat down for 12 months and wrote code non stop without compiling it once, until the program was finished and was used to train the LLM literally only one time?

Dylan16807 · on Feb 22, 2023

I would expect them to use small sizes for almost all the testing.

baobabKoodaa · on Feb 22, 2023

Yes. There _is_ a need to train LLMs more than once, and training is prohibitively expensive, so you need workarounds such as training on a small subset of data, or a smaller version of the model. We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.

Dylan16807 · on Feb 22, 2023

> We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.

Okay. But I was saying someone with millions of dollars to spend could do it. And then another poster was arguing that millions of dollars was not enough to be viable because you need lots of repeated runs.

Nobody was saying a student could train one of these models from scratch. The cool potential is for a student to run one, maybe fine tune it.

baobabKoodaa · on Feb 23, 2023

Here is the upthread comment I was responding to:

> Why would you want to retrain it from scratch every day?

I was explaining why someone might want to retrain it more than once (although not literally every day).

paxys · on Feb 20, 2023

Because things happen every day. If ChatGPT wants to compete with Google, staying up to date with recent events is the minimum bar.

Ajedi32 · on Feb 20, 2023

You wouldn't need to re-train from scratch for that, just fine-tune on the new data sources. I don't think constant re-training is the optimal strategy for that use-case anyway. Bing does it by letting the LLM search a more traditional web index to find the information it needs.

paxys · on Feb 20, 2023

Okay but someone has to do the fine tuning. The code has to be updated. Parts of the training have to be redone. All of this has costs. It isn't a "do it once and forget about it" task that it is being touted as in this thread.

coldtea · on Feb 20, 2023

>The code has to be updated

I'm pretty sure this is not how an LLM works.

>It isn't a "do it once and forget about it" task that it is being touted as in this thread.

That's neither here, nor there. Training the LLM itself is not a "do it multiple times per day if you want to compete with Google" thing as it has been stated in this subthread.

baobabKoodaa · on Feb 23, 2023

> > The code has to be updated

> I'm pretty sure this is not how an LLM works.

You can say that about any software. "You can use this software perfectly well without ever updating it." Sure, you can do that, but typically people have lots of reasons to update software. LLM isn't magic in this sense. An LLM does not mysteriously update its own code if you just wish hard enough. If you want to continue the development of the LLM then you need to make changes to the code, just like with any other software.

coldtea · on Feb 20, 2023

That's not what the training is about.

Things happen everyday, but languages and words and their associations don't change in any measurable way every day...

This is not like web crawling...

simonw · on Feb 20, 2023

That's not necessary. Look at how a Bing works: it's a LLM which can trigger searches, and then gets fed the search results back to it as part of the prompt.

I wrote about one way to implement that pattern here: https://simonwillison.net/2023/Jan/13/semantic-search-answer...

idiotsecant · on Feb 20, 2023

Is there information out there about how much it cost (in time or human-hours) to do the additional training necessary to make chatGPT? I am genuinely curious what the scale of the effort was.

mensetmanusman · on Feb 20, 2023

VCs are funding $100,000,000 AI compute efforts now, so it might be something like that.

Nuzzerino · on Feb 20, 2023

I would hope publicity isn’t the motivation for doing it though.

idiotsecant · on Feb 20, 2023

What motivation would be sufficiently noble?

Nuzzerino · on Feb 20, 2023

Probably one where there isn't an intrinsic conflict of interest with AI risk. Or from a more traditional angle, one where the author's vanity isn't required to be appeased in order for users/customers to be happy. I'm of the opinion that you should do something with game-changing technology because the world needs it, not because you need an ego boost. All technology brings side effects, and there is no greater example of that than "democratized" AI...

idiotsecant · on Feb 20, 2023

People often (usually) do objectively useful things because it's in their selfish interests to do so, ego or otherwise. The surest road to failure is expecting people to act virtuously. Generally systems that assume virtue fail, and systems that assume selfish action and steer that selfish action towards the greater good succeed.

In other words, I don't care why people do things, only that they do.

Nuzzerino · on Feb 20, 2023

That’s fine, as long as publicity isn’t the motivation. It’s safe to assume that isn’t optimal for a projects success (Satoshi understood this). Not sure where you got the idea that the inverse of that was beneficial to such a project. I’ve seen first hand where it becomes a problem.

I’m not aware of many examples of starry-eyed divas achieving great results. Usually you hear about them but only because they are exceptional cases, not the norm. It’s a matter of practicality and not virtue (to say otherwise is purely a straw man argument).

JoshCole · on Feb 21, 2023

> to say otherwise is purely a straw man argument

This is really overconfident.

That publicity isn't causally connected to success is belied by the existence of the advertising industry. While generally refuting across industries, it is worth noting that the most dominant AI company - Google - happens to be in this industry. They are explicitly known - having publicity for - their generous compensation packages. This is because of a causal model of talent attraction.

Success is obviously causally connected to publicity and the idea that it isn't isn't well supported by the evidence. Contrary to your assertion, it was not a safe assumption. Your appeal to Satoshi is an appeal to authority, not a causal model of its shielding off from project impacts.

Nuzzerino · on Feb 21, 2023

> That publicity isn't causally connected to success is belied by the existence of the advertising industry.

The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.

To phrase it plainly: Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about? Or those looking for money or fame? Last time I checked the latter case would be seen as a strong negative. The appeal to authority argument doesn’t apply in this situation because the VC portfolio performance is causally related to how accurately they predict future success of a company.

On the scale of a smaller project like this, a common failure mode is for a maintainer to stop caring about the project and go to the next thing that motivates them. Someone else may attempt to use the code or project without understanding the theory behind it. And even worse: every time this happens is a signal that this is acceptable.

AI is a different beast. Software bugs with big AI systems will become more costly, and eventually deadly. Unfortunately I’m not sure what can be done about it without a global totalitarian regime to ban its use entirely (which is not an idea I support anyway). Eventually the broken clock will be right and some profit-driven AI project will succeed in making the world a not better place, if we are even around to notice :).

I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.

I think someone should train ChatGPT or similar to argue or teach traditional AGI Philosophy/Ethics and hopefully that will move the needle somewhat more than the OpenAI nannyism we have now.

JoshCole · on Feb 21, 2023

> The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.

That the causal model supports publicity seeking leads us to ensemble models. When models are good for different reason, the ensemble of the models ends up better than any individual model. Reinforcement learning research has shown you can successfully train an agent out of decomposed reward signals by building an ensemble model atop them.

The fact that the causality says publicity matters means that agents which recognize the importance that publicity contribute to the solution actually do have the expectation of being part of the solution.

It is very common to see this talked about in terms of diversity improving solution quality when talking about it in the context of companies and it is generally considered a good idea to have a diverse team as a consequence.

Anyway, I'm mostly responding because I disagree with apriori declaration that all who disagree are attacking a straw man.

I think that was overconfident, because the causal structure of publicity and its relation to outcomes disagrees with that.

JoshCole · on Feb 21, 2023

> Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about?

It is worth reflecting on the fact that the founder of OpenAI has had the strongest possible endorsement from Paul Graham. He was claimed to be among the greats before his successes: Paul Graham put him among Steve Jobs and Elon Musk. When Paul Graham stepped down from YCombinator, he was so convinced of Sam's skills that he put Sam in his place. Later Sam started OpenAI.

> I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.

I've read Superintelligence, the Sequences, PAIP, AIMA, Deep Learning, Reinforcement Learning, and Theory of Games and Economic Behavior, taken a course in control theory, and read a book about evolutionary algorithms. I've also built systems after having understood these techniques for literally each of these things I've mentioned with the exception of all of Superintelligence and much of the Sequences with the exception of parts of the sequences which dealt with Bayesian reasoning, which I did implement and like, though I disagree with that community about its optimality because the conditions of ledger arguments aren't true in the real world. In practice, Bayesian approaches are like trying to build a sportscar for a race - you get beaten even though you are doing the fastest thing, because the fastest thing isn't as fast as the slower methods.

Anyway, the combinatorics of multi-step multi-agent decision problems implies a lot of problems for Bostrom and Yudowsky positions on the limits of what intelligence can hope to achieve. I don't find them to be the most formidable thinkers on this subject. In the case of Yudowsky, he admits this, saying that he finds Norvig to be more formidable than he is. And Norvig disagreed with him on AI risk in exactly the context I also disagree and for the same reason I disagree. To ensure you get the point I'll speak in terms of Bostrom's analogies: notice, there is, in fact, a speed limit. The speed of light. Well, what Norvig notices, and what I also notice, and what Bellman noticed when he coined the term combinatorial explosion, is that intractability is an actual issue that you need to confront. It isn't something you can hand wave away with analogy. We don't have enough atoms in our universe.

This is why we get dual mode systems by the way. Not just humans: notice, it happens in chess engines too. The general solvers provides the heuristic which must have error, then the specific solver uses the heuristic to improve, because it is in a more specific situation. Most of the people in the AI risk camp are pretty Yudowskian. They dwell for long periods of time on the overcoming of the biased heuristic. For sure, this makes them more intelligent, but it misinforms them when they try to make inference about general intelligence informed on the tractability in specific situations. It is because, not despite, the intractability that they find such evidence of tractability.

Nuzzerino · on Feb 22, 2023

I'll have to check out Bellman's work, thanks!

JoshCole · on Feb 22, 2023

BTW, Bellman actually coined the term curse of dimensionality [1]; got that confused with combinatorial explosion since it is a synonyms in the contexts I typically encounter it [2].

[1]: https://en.wikipedia.org/wiki/Curse_of_dimensionality

[2]: https://en.wikipedia.org/wiki/Combinatorial_explosion

OpenAI has a pretty good introduction to the Bellman equations in their Spinning Up in RL lessons [3]. Sutton's work in Reinforcement Learning also talks about Bellman's work quite a bit. Though Bellman was actually studying what he called dynamic programming problems his work is now considered foundational in reinforcement learning.

[3]: https://spinningup.openai.com/en/latest/

Uh, and for the dual mode observations the person that brought that to my attention was Noam Brown, not Bellman or Norvig. If you haven't already checked out his work, I recommend it above both Norvig and Bellman. He has some great talks on Youtube and I consider it a shame they aren't more widely viewed [4].

[4]: https://www.youtube.com/watch?v=cn8Sld4xQjg

strohwueste · on Feb 20, 2023

The summary includes a dangerous thought: For example: why does north Korea develop a nuclear bomb is not important, just that they do. But only they why makes it problematic.

idiotsecant · on Feb 21, 2023

I'm not sure who was advocating for North Korea making nuclear weapons in this exchange.

IncRnd · on Feb 20, 2023

Noble? You're anthropomorphising machine learning. On possible motiviation would be to train a model, instead of training a model in order to create publicity around a model being trained.

idiotsecant · on Feb 20, 2023

I think you're misreading, nobody is anthropomorphizing anything other than the very 'anthro' component of the system we're talking about - the people distributing the funding.

IncRnd · on Feb 20, 2023

I may have misread your comment, then. Either way, thank you for the explanation!

tarr11 · on Feb 20, 2023

Wonder if someone would be willing to start an open source project where we could crowdsource donations for training, and people could possibly donate their GPU usage for it.

scottmf · on Feb 20, 2023

That’s what Stability AI has been doing… There are already open source LLMs the size of GPT-3 such as OPT and Bloom

ipsum2 · on Feb 20, 2023

No they haven't. Stability AI is funded by the founder and VC money, not crowdsourcing.

polishdude20 · on Feb 20, 2023

There's gotta be something like this already. Like a SETI @ Home type of thing .

mryab · on Feb 20, 2023

There is! See https://petals.ml/ for inference of models like BLOOM-176B over the internet or https://arxiv.org/abs/2301.11913 and https://arxiv.org/abs/2206.01288 that show you how to do pretraining from scratch in the same setting. Disclaimer: I'm a coauthor of these systems (including the one in OP)

exebook · on Feb 20, 2023

Amazing work! If I had a GPU, I'd join. I know similar project for text-to-image: https://aqualxx.github.io/stable-ui/

efreak · on Feb 22, 2023

https://github.com/aqualxx/stable-horde-notebook

My only problem with stable horde is that their anti-cp measure involves checking the prompt for words like small, meaning I can't use a nsfw-capable model with certain prompts (holding a very small bag, etc). That, and seeing great things in the image rating and being unable to reproduce because it doesn't provide the prompt.

eternalban · on Feb 21, 2023

A few million dollars. Kickstart the project, get 100 a head and 100,000 backers. Also check with Uncle Sam and see if there are any grants that can be used for this. Start a campaign and get rich concerned people to donate. Jeff may also want to show AWS can also train AI so maybe even get a break there an Amazon can get some nice PR. The list of possibilities seems extensive given the price tag of $12MM and upside of a fully public GPT.

naillo · on Feb 20, 2023

There are some open source LLM models already such as the one this repo is running and mentioning like OPT-175B

jejeyyy77 · on Feb 21, 2023

Crowd source the training costs - leaving some profit for the project owner. Open source the models.

JoshCole · on Feb 20, 2023

No, it isn't astronomical. It is smaller than that. Still large, but not astronomical.

ipsum2 · on Feb 20, 2023

Have you tried training a large model before? If not, you're probably discounting how difficult and expensive it is.

JoshCole · on Feb 21, 2023

At Voyager's speed it would take approximately 749,000,000 years to reach Canis Major Dwarf. OpenAI was founded in 2015. So it has been eight years. 749,000,000 - 8 = 748,999,992. 1.06809079e-8% of the time of a random astronomical time; rounded, that is about, uh, 0.00%ish.

I mean, don't get me wrong. It is a very expensive project. It just isn't astronomical. Anyone reading this and thinking - oh I could never do that even in hundreds of millions of years - that would be wrong. If you won the lottery or just made good financial decisions you could do a project comparable to this instead of getting a very nice house in the Bay Area.

coldtea · on Feb 20, 2023

Well, for those that trained the largest one atm, it cost them in the order of 10 million dollars (actually less).

That's how much some tech companies pay for catering.

Hell, that's in the order of a single socialite's wedding costs.

nl · on Feb 21, 2023

There are current open source projects working on training their own LLMs.

I'm aware of one very credible one that has applied for a TPU grant from Google worth under $200K to train the whole model.

I think laion.ai will probably get their first with their fleet of A100s though.

leesec · on Feb 20, 2023

No it isn't. Stable Diffusion is less than 200 grand to train.

huijzer · on Feb 20, 2023

According to Christopher Potts (Stanford Professor and Chair, Department of Linguistics, and Professor, by courtesy, Department of Computer Science), training a large language model costs about 50 million [1].

[1]: https://youtu.be/-lnHHWRCDGk?t=637

nl · on Feb 21, 2023

Yeah this is way wrong, unless counting the salaries of everyone involved for a few years in the lead up while writing software that ended up being used

https://bigscience.huggingface.co/blog/bloom was trained with a $7M grant, and that was the first time they'd done it (so there was a lot of failures).

anononaut · on Feb 20, 2023

I heard it was $4MM alone in AWS compute time.

speedgoose · on Feb 20, 2023

This number seems to match the $200k, if you take into account the cloud margins of our favourite counterfeit products reseller.

postalrat · on Feb 20, 2023

So about 40k on your own machines.

inciampati · on Feb 20, 2023

It does make it seem like a box of H100s will easily be able to make an interesting open LLM.

pavelstoev · on Feb 21, 2023

It does not have to be. We have optimizations for all kinds of workloads - https://CentML.ai

worldsayshi · on Feb 21, 2023

> Cost to train is still astronomical.

It sounds like something that could/should/would be crowdfunded?