For text (the GPT-3 case), that’d work to train a model that had no knowledge of...

rhdunn · on Aug 24, 2022

IANAL, but what you should be able to do is have a set of quotes of one or two sentences each from various sources (books, TV, movies, etc.) that have the modern word or idiom you are specifying. That should then be small enough to fall under fair use (like IMDB, wikiquote, etc. have quotes from films, and good reads, dictionaries, etc. have quotes from books/text), and be plenty enough to capture the meaning of the words/idioms.

You could create your own sentences that you control the copyright of containing the word or idiom, as those words and idioms themselves are not copyrightable. For example: "I fracking hate ice cream!"

For the rest, there is a lot of text upto 1926 (depending on when the author died) that is available for use, so you only need to capture words and idioms changed since then, including any pop culture terms.

yencabulator · on Aug 24, 2022

- "E.T. phone home" was considered copyrightable enough to stop others from making money with it[1]

- trademarks are protected even when the content itself wouldn't be copyrightable; you can't sell AI-generated T-shirts that "happen to" include the word Nike.

- NBC has a trademark on three tones, total length under 2 seconds[2]

[1]: https://fairuse.stanford.edu/2003/09/09/copyright_protection...

[2]: https://en.wikipedia.org/wiki/NBC_chimes

gnopgnip · on Aug 24, 2022

There is tons of english text and images with permissive licensing. All of stack overflow, wikipedia is creative commons. Anything created by the US government or many other governments is public domain

chrismorgan · on Aug 24, 2022

The terms of the CC-BY-SA licenses that Stack Overflow and Wikipedia largely use cannot practically be satisfied in a data model. By design, all outputs derive from all sources to some extent, and the licensing requires that they generally be specifically identified, so you can’t just say “from Wikipedia” or “from Stack Overflow” but “from such-and-such a page, by so-and-so”.

“Permissive” is not enough. You need no-strings-attached, and attribution is a string. Hence mostly talking about public domain materials, which make up the vast majority of suitable materials.

I suspect that covered works of the USA federal government would be quite a large fraction of the public domain material (as reckoned by the USA) from the last 70 years. I don’t believe it’d be enough to be particularly useful, certainly not for pop culture knowledge or colloquial idiom.

Thorrez · on Aug 25, 2022

Big user-generated-content websites (reddit, Facebook, etc) could start new business models of licensing their text for training purposes.