Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I mean we never seem to care about the downward pressures we put on others, so this seems a contrived place to do it now.

My take on the matter is that facts aren't copyrightable. A LLM is basically recording facts.

I have a background in stylometry where you basically determine authorship by recording statistics on known pieces of writing to try to find the author of another piece of writing. So just extracting facts. You could easily now generate random sentences and filter ones that pass a certain threshold. LLMs are basically this on steroids. So it's just facts and data, not copyright infringement.

Otherwise, you'd have to argue that software reading the text contents of a book are copyright infringement.



> I have a background in stylometry where you basically determine authorship by recording statistics on known pieces of writing to try to find the author of another piece of writing. So just extracting facts

The logical jumping happens around this period symbol is spectacular.


I was assuming people were familiar with the process and handwaving the rest.

You measure things like:

* What average sentence length?

* What is the ratio of adjectives to nouns?

* What percentage of sentences are in a passive voice?

* What is the distribution of words used?

* Is "the" used more often by the other than in general usage?

* "whence" vs "when".

* "tyre" vs "tire". Etc.

This could all be printed out and sold as a book and each measure would be a very boring fact that is not copyrightable and contains no copyrightable content.

I could give you a word frequency list from the A Song of Fire and Ice series and George RR Martin could do nothing about it. (Technically, he might have the rights to "Arya" and "Lannister", but to say "the word 'Lannister' appears 1,337 times in the series" would clearly fall under fair-use).


Sorry it's nonsense. You're basically saying:

1. Anything existing is considered a fact.

2. Stating a fact doesn't infringe copyright.

The only conclusion your logic can infer to is copyright doesn't exist.

For example, I can say "the first sentence of A Song of Fire and Ice series is <insert the first sentence>, the second sentence is <insert the second sentence>, the third sentence is...". It's still just a list of facts, so no copyright infringement, right?


I think it that case if you went through the whole book like that a judge would argue that you are infringing because you can recover the whole book like that.

But if you went by and grabbed the very first sentence from each chapter that's is sufficiently in the clear. Especially when you are doing something transformative with it. Like for example analysing what makes for an effective opening sentence.

With copyright you always get to ugly slippery slope arguments, Tom Scott has a great video on it.

But if you cannot extract something the replaces the original work you should be already in the clear. I'm pretty sure Cole's Notes don't need the original author's permission to print and they are arguably a substitute for reading the original work. I've skimmed through entire series by reading the individual episode summaries on Wikipedia. These are in the clear.

The fact that a machine speeds up a human process does not change legality.


How would reproduction of fictional works or the creation of derivative works based on fictional training inputs constitute “a recording of facts”?


I think you're conflating two issues. As I understand it, this justification is about whether the training itself is fair-use. The discussion on whether the outputs are derivative or transformative is separate. Analogously, if I publish a copy of copyrighted poems from memory that would be infringement, but there's nothing infringing about me just reading and memorizing them.


LLMs - as in the models, the weights - do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense. "Reproduction of fictional works" is what is present in the training data. "Creation of derivative works based on fictional training inputs" is what you might be doing as a user of a LLM. The models themselves are, like GP said, recordings of statistical properties of text, just taken up to 11.


> do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense.

Except it does, and it's in a quite literal sense.

You painted a painting. I took a picture of it and compress it as .jpeg. Byte-wise the .jpeg file has no similarity to your original painting. For someone who doesn't know .jpeg, it's just garbage bytes.

So I didn't infringe your copyright by selling this .jpeg file. The user who decodes the .jpeg file and displays it on a monitor does.

Does it sound right? This is how "weights do not contain training data" argument works.

And before "how about artists who store this information in their brains?" Well human being is a special case for every law in every country. Just like selling a cow's liver is never the same as selling a man's, even they're both organic tissues. A human's brain is always going to be treated differently than a hard drive.


My argument is that the modes are somewhere between your JPEG example and "artists' brains" example. Now legally, it's usually not the bytes themselves that matter, but their colour (provenance). But this doesn't make the case of AI models any more clear - the training process is also somewhere between zipping up a folder full of JPEGs, and a human practicing their art through fuck ton of repetitive reproductions of existing works until they grok the style.


I agree.

The thing I hate about this whole situation is that it's going to be decided solely on which side has more lobbying power[1], since just as you said, it's something in between.

[1]: One could say all the policies are decided this way...


> [1]: One could say all the policies are decided this way...

The silver lining here is that lobbyists can only afford to care about finite number of things at a time - so even if they get some laws their way, other factions can push laws mitigating the damage somewhat.

I currently believe (weakly) that the outcome of more and better lobbying isn't laws getting directly worse for society, but rather the regulatory system grinding to a halt under increasing number of laws that exist to cancel out parts of other laws...


> LLMs - as in the models, the weights - do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense.

Since portions of such works can be recovered in inference, they contain at least a lossily-compressed copy of the collection of works used in training. Reproduction isn’t pure coincidence.


Indeed, but learning is effectively[0] a form of lossy compression too. DNN weights are somewhere between zipping up a truckload of JPEGs and deriving facts from first principles. Where exactly are they on this spectrum and how it affects copyright, it's not obvious to me - at least not when trying to argue from fundamental principles. I feel the DNN copyright issues will be ruled on from purely pragmatic position: what's the legal status that upends existing markets the least, and/or is most favored by lobbyists.

--

[0] - I believe it's actually fundamentally the same thing.


> Indeed, but learning is effectively a form of lossy compression

Yeah, that copies in the brain of a human who has experienced the work and not reduced to any othe media are neither “copies” nor “phonorecords” as covered by US copyright law is pretty clear, and that this does not apply to data stored by clever algorithms in computer storage is also pretty clear, so I’m not sure what your point is.

If you want to argue LLMs are people to escape this, then, sure, copyright stops being a problem for their training data, but it still is for their output plus you end up a whole set of new legal problems with using them as people do now, starting with 13th Amendment problems.


> that this does not apply to data stored by clever algorithms in computer storage is also pretty clear,

I'm sorry, this is far from "clear", otherwise we wouldn't be having this discussion. Fair use is a thing. Does fair use apply in this situation? No one knows at this point.


There's a pretty big jump from "machine learning is not inherently different from human learning" to "machine learning models are people".


There's not a big jump from “human learning isn't treated as a copyright violation because it happens in people and not in external media” plus “machine learning should be legally treated like human learning” to “your argument depends on treating ML models as people”.

The fact that people making the second part of the first argument often either are ignorant of or deliberately ignore why human learning is treated the way it is doesn't change the essence of the situation.


That applies to both sides though: people arguing that machine learning should not be treated like human learning often justify it with "ML models aren't people, they're software on a computer". But that argument is also missing the point, and invites this kind of philosophical discussion. The essence of the matter is that there is no high-level principle involved - humans get special treatment because humans write the law. So we can legally round all machine learning down to "zipping a folder full of JPEGs" for pragmatic reasons, and get away with it until we actually create sentient AIs and recognize them as people.

(Which, knowing history and observing how we treat animals, will happen only when the sentient AIs coerce us to recognize them, through violence or the threat of it...)


> The essence of the matter is that there is no high-level principle involved - humans get special treatment because humans write the law.

“The purpose of law is to advance the collective interests of humans in a society” is, I would argue, a high-level principal, and (with variations as to whether it applies to all humans or some subset, and which interests are considered privileged relative to others) nearly universal. Yes, the fact that humans are writing the laws is a reason that principle is chosen, but the idea that there are not high-level principle is just false.


>“your argument depends on treating ML models as people”

They would be "legal people" in the sense that corporations are "people". Some jurisdictions have even granted personhood to non-sentient objects, such as rivers. There's no reason to get held up on the exact word "person" here.


>Otherwise, you'd have to argue that software reading the text contents of a book are copyright infringement.

How so?


Because that's all the ai model is doing. It's reading the books.

How an entity (human or not) changes after reading those books is not in the purview of copyright.

It has not shared them (copied). It has read them, learned from them, and changed itself in response.


In order to present a book to a ML algorithm you need to copy it, either using a camera, or using other means (eg as a file). In USA that copying might be Fair Use, it's almost certainly not allowed in UK law AIUI.

Copying a webpage into a cache that allows presentation of the page to a user is only allowed because it is part of rendering the page to a user. Even if a computer only copies two words at a time from a source text, if it copies a substantial part overall then it still copied.

Honestly, I don't think slurping data to train ML models is allowed by copyright (but, I do, probably think it should be, as long as any significant reproductions are then prosecuted as infringements [also we should reduce copyright terms to ~7 years!]).

This is all my own opinion, unrelated to my employment.


> In order to present a book to a ML algorithm you need to copy it, either using a camera, or using other means

Almost all modern books are available as ebooks for the Kindle or otherwise. You aren't doing anything to the content itself that could be a violation.


So, you copy the Kindle book from a server to your local host to feed into your training algo (ie other means). Or you stream it piecewise into a buffer, still copying.


As long as you only train the network and never use it for inference, I suppose that's a reasonable argument. But a person is restricted in action by copyright regardless of the writing implement they choose to use. Whether they use a pencil or an LLM, they cannot freely reproduce copyrighted works in whole or in part, excepting some narrow conditions.


IMHO, an LLM is not "reading" a book in any way, certainly there is no parallel with how people read. It is encoding the book. That's how it can regurgitate chunks of that book later, including, in some contexts, by providing verbatim spans of text from the book.


Why is an AI model allowed to read a book without paying for access to it, but I as a human have to pay for it to read it?


The AI should, no doubt, pay to read the book!

In technical terms, these companies are trying to avoid the need to consider content licensing, which is a major violation to content creators.


You aren't licensing it. You are reading it. A machine reading it is no different than a human. I mean it really shouldn't be. Should a visually impaired person using a screen reader have to "license" the book?


People come up with the most bizarre corner cases to justify freeloading on other’s intellectual property.

Content is obviously being relicensed as the models are not fully open.

On top of that, it is also being exposed as pay-for-use subscription models without paying anything to original authors. How is that fair use?


That particular transformation of the book does well on a fair use analysis. If the blind person were to use the screen reader to put the book on Spotify, it would almost certainly fail a fair use analysis.


But not all the works that have been consumed by LLMs are available to read without a license to do so.


You can check the book out at the library and read it. Someone has to buy it, but not necessarily you personally.


You can check out some books at libraries for free, sure.

LLMs are trained on tons of books and papers that are not available to humans for free anywhere.


Can you give an example of something that they are trained on that a human can't somehow read without buying?


Textbooks for 1 example.

You are telling me that all the books in books2 and books3 can be acquired and read by myself completely free of charge legally without piracy?

If humans can get these books for free legally, why are so many people paying so much to buy them for their college courses?

I thought it's well known that LLMs are being trained on pirated content.


Textbooks aren't somehow banned from libraries. You'll find lots of textbooks in public libraries, school libraries, and other libraries where they can be read without paying for the book.

Now if you are saying they obtained the book from illegal sources that is a valid argument. But the issue there is orthogonal feeding the contents of the books into an LLM. It is legal for me to view and photograph the Mona Lisa. If I break into he Louvre to view the Mona Lisa, the issue isn't that I viewed the artwork. The issue is that I broke into the museum.


Yes but then the issue isn't whether copyrighted works can be used as training data, but how you can obtain those works to use them as training data. I agree that you can't justify copyright infringement based on the fact you are using it as training data. But I do not agree with those that say using a legally borrowed book as training data violates copyright.

But if you have an example of some types of books that it is impossible to borrow from anywhere on the planet please share. Textbooks are readily available from libraries.


No, they aren't banned, but not being able to find one at a library is not a legal justification for me to go and torrent it. It shouldn't be for a corporation training an LLM either.

It's widely known that some of the big LLMs were trained on book datasets that were acquired via torrents. And that those datasets do contain at least some books that are not available anywhere freely via legal channels.


There is nothing illegal about reading a book that you haven't paid for.

What's illegal is making and distributing copies of a book.

But the reader isn't in trouble for reading a book they don't own.


Really?

So I am free to torrent all the books that I want, and there is no legal action that can ever be taken against me?

If so that would be news to me.


If you start distributing torrents and copies to other people?

No of course not.

Creating copies, and distributing them to other people is against the law.

That has nothing to do with what I said though.

What I said is that it is not illegal for a reader to read something that they havent paid for.

Do you see how that is different from distributing copies of something to other people?


That's not what I am asking though...

I am asking about reading content that is not available through any free means, and pirating it in order to read it instead of paying for it from the author.


> I am asking about reading content that is not available through any free means, and having to pirate it in order to read it.

Reading it is legal.

Distributing torrents is not though. That is the part of piracy that is illegal. The illegal part is when you create and distributing copies.

Copyright law has nothing to do with reading stuff that you have not paid for.

Instead, it is about the illegality of creating and distributing copies.

That's why it's called "copyright" law. Because it is about copies.

It is not "readright law".

No, there is nothing illegal about reading something that you havent paid for.


I didn't know it was legal to download books from torrents that are normally only available from the author for a cost.

What about movies and tv shows? I can download them via torrents to view without paying as well, completely legally?

Nobody can ever take any legal action against me ever for doing this?

How though would the creator of the content ever get paid for it if everyone can read and view it completely free? Why would they go through all the work of creating it with no gain to themselves?


Torrents are a somewhat special case since, by protocol, downloading and sharing happen at the same time. So by running a torrent of pirated media, you are necessarily infringing copyright (there are hacked clients that report bogus information that allow you to download only without sharing, but they are easily detectable and ~nobody uses them).

As far as I know, nobody in this country has ever been successfully prosecuted for the mere act of downloading, or even having obviously-illicitly-acquired pirated media in their possession.

As to your final question, the actual answer is because people are lazy. It requires more effort and technical sophistication to maintain a movie file collection than it does to run Netflix. Nearly any piece of remotely popular content is available for free with a bit of knowledge, but it's less hassle to just subscribe to services. There is no world in which piracy actually endangers creators.


I should have used an example other than torrenting, one that doesn't automatically distribute back.

Also, I'd say there is a difference between being illegal and whether someone has been successfully prosecuted or not.

I don't understand though how piracy doesn't endanger creators though. If a small time creator creates something, say a video game, and everyone pirates it instead of buys it, how has that not endangered him? Simply because you don't think it's plausible that enough people would refrain from paying?


Pretty much. For the simple reason that the "legitimate" way is always easier and the people that pirate are always a minority as a result.


> Nobody can ever take any legal action against me ever for doing this?

They can if you create or distribute copies.

That part is illegal and will get you in trouble, and is why people get in trouble for torrenting.

Just reading or watching other people's content isn't the illegal part.


I think it's safe to assume that AI has to pay for the book as well, sooner or later, as shown in the deals OpenAI has made with content publishers such as AP [0]. The question is how much they should pay.

[0] - https://www.pymnts.com/digital-payments/2023/58percent-of-co...


That's not the argument being made, is it? If the AI is trained on pirated data, it's a completely different case than the underlying principle of AI meaning all generated content is copyright infringement.


But I thought they were being trained on pirated data...

Seems like a problem to me.


They're being trained on copyrighted data that was publicly accessible. The lawsuits focus on whether or not this is copyright violation, not the legality of accessing the material (AFAICT).


So the world is just ignoring the legality of accessing the material in the way that they did?

Man, I wish as a human I could use the defense of "but it was publicly accessible in a torrent" as a valid reason that I acquired and consumed some content.


Sorry, do you have some proof that they pirated content that the scrupulous copyright holders themselves don't? That's not the issue at hand and, AFAICT, nobody is accusing them of doing it.

Nobody is ignoring it, that would be a crime if they did, but since there's no evidence of it... you're being upset at your own hypothetical.


I'm really confused.

I see articles all over that LLMs for example used books3, which the creator himself has admitted came from torrenting from bibliotik, and which contains at least some books that are not otherwise freely available.

The issue that content creators have with LLMs is not getting proper attribution for their creations and source material that went into training these LLMs, and it seems pretty clear to me that some of the content that was used to train LLMs was not legally obtained and licensed to consume.


'we' who?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: