> but surely you can't just... use stock photos without paying for the license? ...

BeefWellington · on Aug 24, 2022

> Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).

"Probably" is doing a lot of heavy lifting in that sentence.

As for "_learned_", that's pretty debatable considering it's reproducing recognizable trademark infringement.

> The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

I have no idea why anyone would assume the "move fast and break things" disruption mindset that pervades tech companies these days, especially in spaces like ML/"AI", would mean they considered the legality, ethics, or good business sense of their training dataset.

As with Copilot, I suspect the DALL-E terms of use puts the onus on the user to avoid using infringing items.

ShamelessC · on Aug 24, 2022

> "Probably" is doing a lot of heavy lifting in that sentence.

Indeed, that's why I used it. It wasn't long ago that DALLE-2 outputs were the ownership of OpenAI (they changed it so the owner is the user recently). Definitely plenty of room for debate on who the owner should be.

> As for "_learned_", that's pretty debatable considering it's reproducing recognizable trademark infringement.

I guess. I meant this strictly in the machine learning sense, where "learned" is typically used to describe models trained via stochastic gradient descent.

> I have no idea why anyone would assume the "move fast and break things" disruption mindset that pervades tech companies these days, especially in spaces like ML/"AI", would mean they considered the legality, ethics, or good business sense of their training dataset.

I agree mostly, except that companies like Alamy have their hooks in everywhere so they can seek rent. I just figured they might be cautious about this if e.g. Microsoft (OpenAI's business partner) had an existing agreement in place for Bing or something.

yreg · on Aug 24, 2022

Unlike Copilot, DALL-E et al. don't produce verbatim copies of trained data.

Copying ideas and styles has always been a fundamental part of art history, so an artwork right holder might have a hard time successfuly sueing a user for the user's generated image looking similar to the right holder's artwork.

BeefWellington · on Aug 24, 2022

"Verbatim" is an interesting term since I'm not certain it matters. In this case OP here demonstrated DALL-E generating a trademarked watermark on top of an image. I doubt the courts, looking at that, would believe that that's not close enough to their trademark to infringe.

The art world's copyright suits are all over the place in terms of what's sufficient to meet the threshold of "fair use" or "not a copy".

It's hard for me as a layperson to see works by Richard Prince[1] as substantially transformative (clearly one work is derived from the other) and even the different courts couldn't agree on this as it was initially found in favor of the plaintiffs but then Prince won his appeal.

My approach to this kind of thing is simply this: Does this technology inherently open me up to lawsuits in undecided or highly unreliable legal territory? If yes, steer well clear of using it in any capacity.

[1]: https://www.artnews.com/art-in-america/features/richard-prin...

yreg · on Aug 24, 2022

The case you refer to (Cariou v. Prince) is also a case where part of the artwork is reproduced verbatim :)

kej · on Aug 24, 2022

If they had been paying for the images upfront, wouldn't you expect them to train the model on the non-watermarked versions?

ShamelessC · on Aug 24, 2022

Good point! They certainly had no obligation to pay, either. Perhaps they just scraped it all.

namrog84 · on Aug 24, 2022

The watermarked version might be more prolific with better metatags and descriptions around them.

The non watermarked versions are likely internal only and have far less diverse descriptions.

gricardo99 · on Aug 24, 2022

if they paid for access, or permission, why train on the watermark versions?

I’m guessing they assumed fair use and there will be lawsuits.

criddell · on Aug 24, 2022

Is that representation of the watermark a trademark? If so, then copyright infringement might not matter, but use of the trademark may.

chrismorgan · on Aug 24, 2022

I would be very surprised if OpenAI paid anything for these, because it would set precedent that copyright infringement was applicable, which would be fatal down the road. (The only argument they could possibly mount in their defence would be that they wanted to train on the original images without watermarks.)

whywhywhywhy · on Aug 24, 2022

What if my dataset is just the one Getty image I don’t want to pay for.

bhedgeoser · on Aug 24, 2022

What if I write a machine learning algorithm that only generates images that it has seen in the training dataset, with one pixel slightly different.

can16358p · on Aug 24, 2022

It won't be transformative enough and you'd probably lose the case.

(IANAL)

sh4rks · on Aug 24, 2022

What about two pixels?

can16358p · on Aug 24, 2022

Not enough... three... four... I think at some point there's a blurry gray area where a human judge would decide it is infringement or not. Of course not with a few pixels but at whole image level.