The New York Times (NYT) lawsuit against OpenAI and Microsoft breaks new ground in the ongoing legal challenges posed by the use of copyrighted data to “train” or improve generative AI. Ta.
A number of lawsuits have already been filed against AI companies, including one brought by Getty Images against StabilityAI, the developer of the online text-to-image generator Stable Diffusion. Authors George R.R. Martin and John Grisham are also suing ChatGPT's owner, OpenAI, over alleged copyright infringement. But the NYT case is far from “more of the same” as it throws some interesting new arguments into the mix.
advertisement
Continue reading below
This legal action focuses on new issues regarding the value of training data and reputational damage. This is a powerful combination of trademark and copyright and may test the normally relied upon fair use defense.
No doubt it will be watched closely by media organizations looking to challenge the usual “ask for forgiveness, not permission” approach to training data. Training data is used to improve the performance of AI systems and typically consists of real-world information, often obtained from the internet.
The case also offers novel arguments not presented in other similar cases related to what is called “hallucination,” in which AI systems generate false or misleading information while presenting it as fact. It has been raised. In fact, this argument may be one of the most powerful in this case.
The NYT case in particular raises three interesting takes on the usual approach. First, NYT content's reputation for reliable news and information makes it valuable and desirable as training data for use in AI.
Second, paywalls make it commercially damaging to reproduce articles upon request. Third, ChatGPT's “illusion” is, in effect, causing reputational damage to the New York Times through false attribution.
This is not just a generative AI copyright dispute. The first argument presented by the NYT claims that ChatGPT's training phase infringes copyright because the training data used by OpenAI is protected by copyright. We've seen this type of argument played out before in other conflicts.
Fair use?
The challenge to this type of attack is the fair use shield. In the United States, fair use is a legal principle that allows the use of copyrighted material under certain circumstances, such as news reporting, academic research, and commentary.
While OpenAI's response has been very cautious so far, a key tenet of the company's statement is that its use of online data does fall under the doctrine of “fair use.”
Anticipating some of the difficulties that such a fair use defense could potentially cause, the NYT took a slightly different angle. In particular, it aims to distinguish its data from standard data. The NYT intends to capitalize on the accuracy, credibility, and prestige of its reporting. We claim that this creates a particularly desirable dataset.
As a reputable and trusted source, the article claims that it has additional weight and credibility in training the generative AI and is part of a data subset that is given additional weight in its training.
ChatGPT claims that by duplicating large portions of articles in response to prompts, it can deny the paywalled NYT, its visitors, and the revenue it would otherwise receive. The introduction of some aspects of commercial competition and commercial advantage appears to be intended to avoid the normal fair use defense common to these claims.
It would be interesting to see if special weighted assertions in the training data have an impact. If this were to happen, it would set the path for other media organizations to challenge the use of their reporting in training data without their permission.
advertisement
Continue reading below
The final element of the NYT's argument offers a new angle on the issue. This suggests that his NYT brand is being harmed through the material ChatGPT generates. Although presented almost as an afterthought in the complaint, it may be the allegation that poses the most challenge to Open AI.
This is a discussion about AI's “hallucinations”. The NYT claims that ChatGPT makes matters worse because it presents information as if it were from the NYT.
The paper also suggests that consumers may act on the summary ChatGPT provides, believing that the information is from the NYT and can be trusted. The reputational damage is caused by the newspaper's inability to control what his ChatGPT creates.
This is an interesting challenge. “Illusions” are recognized as a problem with AI-generated responses, and the NYT argues that redressing the reputational damage may not be easy.
NYT's claims launch a series of novel attacks that shift focus from copyright to how copyrighted data is presented to users by ChatGPT and the value of that data to newspapers. It is something to do. This is much harder for OpenAI to defend against.
This case will be closely watched by other media publishers, especially those behind paywalls, especially in terms of how it interacts with standard fair use defenses.
Once NYT datasets are recognized as having the “enhanced value” they claim to have, there is a path to monetizing them for training AI, rather than the currently prevalent “forgive, not permit” approach. It may be opened.
Peter Vaughan is a Senior Lecturer at Nottingham Law School, Nottingham Trent University.
This article is republished from The Conversation under a Creative Commons license. Read the original article.