There's a big difference between a model and almost every other form of compilation of human knowledge, which is when the human intervention occurs. A dictionary is a collection of definitions from humans that is derived from their experience reading texts. Human knowledge and experience moderates the flow from the text the human reads to the text in the dictionary. LLM training has no such moderation, and that is a significant factor in whether something is a derived work or not.
This is really all up to the courts, and I don't think anyone is confident about how it will shake out. However, the fact that OpenAI isn't even trying to make the argument that an LLM isn't a derived work of the training set (they are going straight to fair use, which is an acknowledgement of infringement) suggests that this is not actually contentious.
This is really all up to the courts, and I don't think anyone is confident about how it will shake out. However, the fact that OpenAI isn't even trying to make the argument that an LLM isn't a derived work of the training set (they are going straight to fair use, which is an acknowledgement of infringement) suggests that this is not actually contentious.