Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> but surely you can't just... use stock photos without paying for the license?

They aren't hosting the infringing content. Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).

The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.



> Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).

"Probably" is doing a lot of heavy lifting in that sentence.

As for "_learned_", that's pretty debatable considering it's reproducing recognizable trademark infringement.

> The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

I have no idea why anyone would assume the "move fast and break things" disruption mindset that pervades tech companies these days, especially in spaces like ML/"AI", would mean they considered the legality, ethics, or good business sense of their training dataset.

As with Copilot, I suspect the DALL-E terms of use puts the onus on the user to avoid using infringing items.


> "Probably" is doing a lot of heavy lifting in that sentence.

Indeed, that's why I used it. It wasn't long ago that DALLE-2 outputs were the ownership of OpenAI (they changed it so the owner is the user recently). Definitely plenty of room for debate on who the owner should be.

> As for "_learned_", that's pretty debatable considering it's reproducing recognizable trademark infringement.

I guess. I meant this strictly in the machine learning sense, where "learned" is typically used to describe models trained via stochastic gradient descent.

> I have no idea why anyone would assume the "move fast and break things" disruption mindset that pervades tech companies these days, especially in spaces like ML/"AI", would mean they considered the legality, ethics, or good business sense of their training dataset.

I agree mostly, except that companies like Alamy have their hooks in everywhere so they can seek rent. I just figured they might be cautious about this if e.g. Microsoft (OpenAI's business partner) had an existing agreement in place for Bing or something.


Unlike Copilot, DALL-E et al. don't produce verbatim copies of trained data.

Copying ideas and styles has always been a fundamental part of art history, so an artwork right holder might have a hard time successfuly sueing a user for the user's generated image looking similar to the right holder's artwork.


"Verbatim" is an interesting term since I'm not certain it matters. In this case OP here demonstrated DALL-E generating a trademarked watermark on top of an image. I doubt the courts, looking at that, would believe that that's not close enough to their trademark to infringe.

The art world's copyright suits are all over the place in terms of what's sufficient to meet the threshold of "fair use" or "not a copy".

It's hard for me as a layperson to see works by Richard Prince[1] as substantially transformative (clearly one work is derived from the other) and even the different courts couldn't agree on this as it was initially found in favor of the plaintiffs but then Prince won his appeal.

My approach to this kind of thing is simply this: Does this technology inherently open me up to lawsuits in undecided or highly unreliable legal territory? If yes, steer well clear of using it in any capacity.

[1]: https://www.artnews.com/art-in-america/features/richard-prin...


The case you refer to (Cariou v. Prince) is also a case where part of the artwork is reproduced verbatim :)


If they had been paying for the images upfront, wouldn't you expect them to train the model on the non-watermarked versions?


Good point! They certainly had no obligation to pay, either. Perhaps they just scraped it all.


The watermarked version might be more prolific with better metatags and descriptions around them.

The non watermarked versions are likely internal only and have far less diverse descriptions.


if they paid for access, or permission, why train on the watermark versions?

I’m guessing they assumed fair use and there will be lawsuits.


Is that representation of the watermark a trademark? If so, then copyright infringement might not matter, but use of the trademark may.


I would be very surprised if OpenAI paid anything for these, because it would set precedent that copyright infringement was applicable, which would be fatal down the road. (The only argument they could possibly mount in their defence would be that they wanted to train on the original images without watermarks.)


What if my dataset is just the one Getty image I don’t want to pay for.


What if I write a machine learning algorithm that only generates images that it has seen in the training dataset, with one pixel slightly different.


It won't be transformative enough and you'd probably lose the case.

(IANAL)


What about two pixels?


Not enough... three... four... I think at some point there's a blurry gray area where a human judge would decide it is infringement or not. Of course not with a few pixels but at whole image level.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: