For text (the GPT-3 case), that’d work to train a model that had no knowledge of the last century of popular culture or idiom, and was significantly biased to more formal and traditional writing styles. The effects of this would be really quite interesting, but I think it would significantly limit the places it could be usefully applied.
For DALL·E and Copilot, I’m confident that you couldn’t find anywhere near enough material to produce results anywhere near as good as what there is now. I strongly suspect the results would be too poor to be useful in most places where they may be useful now.
IANAL, but what you should be able to do is have a set of quotes of one or two sentences each from various sources (books, TV, movies, etc.) that have the modern word or idiom you are specifying. That should then be small enough to fall under fair use (like IMDB, wikiquote, etc. have quotes from films, and good reads, dictionaries, etc. have quotes from books/text), and be plenty enough to capture the meaning of the words/idioms.
You could create your own sentences that you control the copyright of containing the word or idiom, as those words and idioms themselves are not copyrightable. For example: "I fracking hate ice cream!"
For the rest, there is a lot of text upto 1926 (depending on when the author died) that is available for use, so you only need to capture words and idioms changed since then, including any pop culture terms.
- "E.T. phone home" was considered copyrightable enough to stop others from making money with it[1]
- trademarks are protected even when the content itself wouldn't be copyrightable; you can't sell AI-generated T-shirts that "happen to" include the word Nike.
- NBC has a trademark on three tones, total length under 2 seconds[2]
There is tons of english text and images with permissive licensing. All of stack overflow, wikipedia is creative commons. Anything created by the US government or many other governments is public domain
The terms of the CC-BY-SA licenses that Stack Overflow and Wikipedia largely use cannot practically be satisfied in a data model. By design, all outputs derive from all sources to some extent, and the licensing requires that they generally be specifically identified, so you can’t just say “from Wikipedia” or “from Stack Overflow” but “from such-and-such a page, by so-and-so”.
“Permissive” is not enough. You need no-strings-attached, and attribution is a string. Hence mostly talking about public domain materials, which make up the vast majority of suitable materials.
I suspect that covered works of the USA federal government would be quite a large fraction of the public domain material (as reckoned by the USA) from the last 70 years. I don’t believe it’d be enough to be particularly useful, certainly not for pop culture knowledge or colloquial idiom.
For DALL·E and Copilot, I’m confident that you couldn’t find anywhere near enough material to produce results anywhere near as good as what there is now. I strongly suspect the results would be too poor to be useful in most places where they may be useful now.