But I thought they were being trained on pirated data... Seems like a problem to...

Tadpole9181 · on Oct 5, 2023

They're being trained on copyrighted data that was publicly accessible. The lawsuits focus on whether or not this is copyright violation, not the legality of accessing the material (AFAICT).

SirMaster · on Oct 5, 2023

So the world is just ignoring the legality of accessing the material in the way that they did?

Man, I wish as a human I could use the defense of "but it was publicly accessible in a torrent" as a valid reason that I acquired and consumed some content.

Tadpole9181 · on Oct 5, 2023

Sorry, do you have some proof that they pirated content that the scrupulous copyright holders themselves don't? That's not the issue at hand and, AFAICT, nobody is accusing them of doing it.

Nobody is ignoring it, that would be a crime if they did, but since there's no evidence of it... you're being upset at your own hypothetical.

SirMaster · on Oct 5, 2023

I'm really confused.

I see articles all over that LLMs for example used books3, which the creator himself has admitted came from torrenting from bibliotik, and which contains at least some books that are not otherwise freely available.

The issue that content creators have with LLMs is not getting proper attribution for their creations and source material that went into training these LLMs, and it seems pretty clear to me that some of the content that was used to train LLMs was not legally obtained and licensed to consume.