Meta has released the model weights for OPT-175B, which is used in the paper. There's also a lot of full release LLMs from other labs on the way as well.
While OPT-175B is great to have publicly available, it needs a lot more training to achieve good results. Meta trained OPT on 180B tokens, compared to 300B that GPT-3 saw. And the Chinchilla scaling laws suggest that almost 4T tokens would be required to get the most bang for compute buck.
And on top of that, there are some questions on the quality of open source data (The Pile) vs OpenAI’s proprietary dataset, which they seem to have spent a lot of effort cleaning. So: open source models are probably data-constrained, in both quantity and quality.
OPT-175B isn't publicly available, sadly. It's available to research institutions, which is much better than "Open"AI, but it doesn't help us hobbyists/indie researchers much.
I wonder when we'll start putting these models on the pirate bay or similar. Seems like an excellent use for the tech. Has no one tried to upload OPT-175B anywhere like that yet?
No, Stable Diffusion isn't the only one to release their weights. OpenAI hasn't been releasing weights for ChatGPT, but Stable Diffusion isn't the only ones releasing weights [1].
Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development. If you actually want to keep developing the model, you need the funding to be able to train it more than once.
To summarize this discussion, we went from "this might mean we don't need a fleet of $10k+ GPUs to even run a LLM" to "yeah but an individual couldn't train one every day though". These goalposts are breaking the sound barrier.
>but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development
This is not "software development" in general, this is LLM training.
It's not like you're building some regular app, api, or backend.
If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong. The researchers who created OPT didn't go into a basement for 12 months, then come out, train their model once, hit publish, and go to coffee. That is a fantasy. Likewise, if a CS student wants to dabble in this research, they need the ability to train more than once.
I'm not gonna engage in a rhetorical argument about whether this should be called "software development" or "LLM development" or something else. That's unrelated to the question of how much training is required.
>If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong.
No, I'm rather claiming that what you claimed is wrong in the context of LLM training: "Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development".
LLM training is not the same as writing a program and "running your code with different variations". For LLM you don't need to quickly rerun everything with some new corpus - it would be nice, but it's neither a prerequisite, not even crucial for any current use.
Hell, it's not even a "prerequisite" in programming, just good to have. Tons of great programs have been written with very slow build times, without quick edit/compile/build/run cycles.
I wasn't talking about running the same code with a new corpus. For that kind of use case one can simply fine tune the pretrained model. The example I gave was "if a CS student wants to dabble in this research".
You said "LLM training is not the same as writing a program and running your code with different variations". How do you think these LLMs were made, seriously? Do you think Facebook researchers sat down for 12 months and wrote code non stop without compiling it once, until the program was finished and was used to train the LLM literally only one time?
Yes. There _is_ a need to train LLMs more than once, and training is prohibitively expensive, so you need workarounds such as training on a small subset of data, or a smaller version of the model. We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.
> We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.
Okay. But I was saying someone with millions of dollars to spend could do it. And then another poster was arguing that millions of dollars was not enough to be viable because you need lots of repeated runs.
Nobody was saying a student could train one of these models from scratch. The cool potential is for a student to run one, maybe fine tune it.
You wouldn't need to re-train from scratch for that, just fine-tune on the new data sources. I don't think constant re-training is the optimal strategy for that use-case anyway. Bing does it by letting the LLM search a more traditional web index to find the information it needs.
Okay but someone has to do the fine tuning. The code has to be updated. Parts of the training have to be redone. All of this has costs. It isn't a "do it once and forget about it" task that it is being touted as in this thread.
>It isn't a "do it once and forget about it" task that it is being touted as in this thread.
That's neither here, nor there. Training the LLM itself is not a "do it multiple times per day if you want to compete with Google" thing as it has been stated in this subthread.
You can say that about any software. "You can use this software perfectly well without ever updating it." Sure, you can do that, but typically people have lots of reasons to update software. LLM isn't magic in this sense. An LLM does not mysteriously update its own code if you just wish hard enough. If you want to continue the development of the LLM then you need to make changes to the code, just like with any other software.
That's not necessary. Look at how a Bing works: it's a LLM which can trigger searches, and then gets fed the search results back to it as part of the prompt.
Is there information out there about how much it cost (in time or human-hours) to do the additional training necessary to make chatGPT? I am genuinely curious what the scale of the effort was.
Probably one where there isn't an intrinsic conflict of interest with AI risk. Or from a more traditional angle, one where the author's vanity isn't required to be appeased in order for users/customers to be happy. I'm of the opinion that you should do something with game-changing technology because the world needs it, not because you need an ego boost. All technology brings side effects, and there is no greater example of that than "democratized" AI...
People often (usually) do objectively useful things because it's in their selfish interests to do so, ego or otherwise. The surest road to failure is expecting people to act virtuously. Generally systems that assume virtue fail, and systems that assume selfish action and steer that selfish action towards the greater good succeed.
In other words, I don't care why people do things, only that they do.
That’s fine, as long as publicity isn’t the motivation. It’s safe to assume that isn’t optimal for a projects success (Satoshi understood this). Not sure where you got the idea that the inverse of that was beneficial to such a project. I’ve seen first hand where it becomes a problem.
I’m not aware of many examples of starry-eyed divas achieving great results. Usually you hear about them but only because they are exceptional cases, not the norm. It’s a matter of practicality and not virtue (to say otherwise is purely a straw man argument).
That publicity isn't causally connected to success is belied by the existence of the advertising industry. While generally refuting across industries, it is worth noting that the most dominant AI company - Google - happens to be in this industry. They are explicitly known - having publicity for - their generous compensation packages. This is because of a causal model of talent attraction.
Success is obviously causally connected to publicity and the idea that it isn't isn't well supported by the evidence. Contrary to your assertion, it was not a safe assumption. Your appeal to Satoshi is an appeal to authority, not a causal model of its shielding off from project impacts.
> That publicity isn't causally connected to success is belied by the existence of the advertising industry.
The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.
To phrase it plainly: Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about? Or those looking for money or fame? Last time I checked the latter case would be seen as a strong negative. The appeal to authority argument doesn’t apply in this situation because the VC portfolio performance is causally related to how accurately they predict future success of a company.
On the scale of a smaller project like this, a common failure mode is for a maintainer to stop caring about the project and go to the next thing that motivates them. Someone else may attempt to use the code or project without understanding the theory behind it. And even worse: every time this happens is a signal that this is acceptable.
AI is a different beast. Software bugs with big AI systems will become more costly, and eventually deadly. Unfortunately I’m not sure what can be done about it without a global totalitarian regime to ban its use entirely (which is not an idea I support anyway). Eventually the broken clock will be right and some profit-driven AI project will succeed in making the world a not better place, if we are even around to notice :).
I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.
I think someone should train ChatGPT or similar to argue or teach traditional AGI Philosophy/Ethics and hopefully that will move the needle somewhat more than the OpenAI nannyism we have now.
> The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.
That the causal model supports publicity seeking leads us to ensemble models. When models are good for different reason, the ensemble of the models ends up better than any individual model. Reinforcement learning research has shown you can successfully train an agent out of decomposed reward signals by building an ensemble model atop them.
The fact that the causality says publicity matters means that agents which recognize the importance that publicity contribute to the solution actually do have the expectation of being part of the solution.
It is very common to see this talked about in terms of diversity improving solution quality when talking about it in the context of companies and it is generally considered a good idea to have a diverse team as a consequence.
Anyway, I'm mostly responding because I disagree with apriori declaration that all who disagree are attacking a straw man.
I think that was overconfident, because the causal structure of publicity and its relation to outcomes disagrees with that.
> Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about?
It is worth reflecting on the fact that the founder of OpenAI has had the strongest possible endorsement from Paul Graham. He was claimed to be among the greats before his successes: Paul Graham put him among Steve Jobs and Elon Musk. When Paul Graham stepped down from YCombinator, he was so convinced of Sam's skills that he put Sam in his place. Later Sam started OpenAI.
> I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.
I've read Superintelligence, the Sequences, PAIP, AIMA, Deep Learning, Reinforcement Learning, and Theory of Games and Economic Behavior, taken a course in control theory, and read a book about evolutionary algorithms. I've also built systems after having understood these techniques for literally each of these things I've mentioned with the exception of all of Superintelligence and much of the Sequences with the exception of parts of the sequences which dealt with Bayesian reasoning, which I did implement and like, though I disagree with that community about its optimality because the conditions of ledger arguments aren't true in the real world. In practice, Bayesian approaches are like trying to build a sportscar for a race - you get beaten even though you are doing the fastest thing, because the fastest thing isn't as fast as the slower methods.
Anyway, the combinatorics of multi-step multi-agent decision problems implies a lot of problems for Bostrom and Yudowsky positions on the limits of what intelligence can hope to achieve. I don't find them to be the most formidable thinkers on this subject. In the case of Yudowsky, he admits this, saying that he finds Norvig to be more formidable than he is. And Norvig disagreed with him on AI risk in exactly the context I also disagree and for the same reason I disagree. To ensure you get the point I'll speak in terms of Bostrom's analogies: notice, there is, in fact, a speed limit. The speed of light. Well, what Norvig notices, and what I also notice, and what Bellman noticed when he coined the term combinatorial explosion, is that intractability is an actual issue that you need to confront. It isn't something you can hand wave away with analogy. We don't have enough atoms in our universe.
This is why we get dual mode systems by the way. Not just humans: notice, it happens in chess engines too. The general solvers provides the heuristic which must have error, then the specific solver uses the heuristic to improve, because it is in a more specific situation. Most of the people in the AI risk camp are pretty Yudowskian. They dwell for long periods of time on the overcoming of the biased heuristic. For sure, this makes them more intelligent, but it misinforms them when they try to make inference about general intelligence informed on the tractability in specific situations. It is because, not despite, the intractability that they find such evidence of tractability.
BTW, Bellman actually coined the term curse of dimensionality [1]; got that confused with combinatorial explosion since it is a synonyms in the contexts I typically encounter it [2].
OpenAI has a pretty good introduction to the Bellman equations in their Spinning Up in RL lessons [3]. Sutton's work in Reinforcement Learning also talks about Bellman's work quite a bit. Though Bellman was actually studying what he called dynamic programming problems his work is now considered foundational in reinforcement learning.
Uh, and for the dual mode observations the person that brought that to my attention was Noam Brown, not Bellman or Norvig. If you haven't already checked out his work, I recommend it above both Norvig and Bellman. He has some great talks on Youtube and I consider it a shame they aren't more widely viewed [4].
The summary includes a dangerous thought: For example: why does north Korea develop a nuclear bomb is not important, just that they do.
But only they why makes it problematic.
Noble? You're anthropomorphising machine learning. On possible motiviation would be to train a model, instead of training a model in order to create publicity around a model being trained.
I think you're misreading, nobody is anthropomorphizing anything other than the very 'anthro' component of the system we're talking about - the people distributing the funding.
Wonder if someone would be willing to start an open source project where we could crowdsource donations for training, and people could possibly donate their GPU usage for it.
My only problem with stable horde is that their anti-cp measure involves checking the prompt for words like small, meaning I can't use a nsfw-capable model with certain prompts (holding a very small bag, etc). That, and seeing great things in the image rating and being unable to reproduce because it doesn't provide the prompt.
A few million dollars. Kickstart the project, get 100 a head and 100,000 backers. Also check with Uncle Sam and see if there are any grants that can be used for this. Start a campaign and get rich concerned people to donate. Jeff may also want to show AWS can also train AI so maybe even get a break there an Amazon can get some nice PR. The list of possibilities seems extensive given the price tag of $12MM and upside of a fully public GPT.
At Voyager's speed it would take approximately 749,000,000 years to reach Canis Major Dwarf. OpenAI was founded in 2015. So it has been eight years. 749,000,000 - 8 = 748,999,992. 1.06809079e-8% of the time of a random astronomical time; rounded, that is about, uh, 0.00%ish.
I mean, don't get me wrong. It is a very expensive project. It just isn't astronomical. Anyone reading this and thinking - oh I could never do that even in hundreds of millions of years - that would be wrong. If you won the lottery or just made good financial decisions you could do a project comparable to this instead of getting a very nice house in the Bay Area.
According to Christopher Potts (Stanford Professor and Chair, Department of Linguistics, and Professor, by courtesy, Department of Computer Science), training a large language model costs about 50 million [1].
Yeah this is way wrong, unless counting the salaries of everyone involved for a few years in the lead up while writing software that ended up being used