New artificial intelligence tools that write human-like prose and create stunning images have taken the world by storm. But these awe-inspiring technologies are not creating something out of nothing; they’re trained on lots and lots of data, some of which come from works under copyright protection.
Now, the writers, artists and others who own the rights to the material used to teach ChatGPT and other generative AI tools want to stop what they see as blatant copyright infringement of mass proportions.
With billions of dollars at stake, U.S. courts will most likely have to sort out who owns what, using the 1976 Copyright Act, the same law that has determined who owns much of the content published on the internet.
U.S. copyright law seeks to strike a balance between protecting the rights of content creators and fostering creativity and innovation. Among other things, the law gives content creators the exclusive right to reproduce their original work and to prepare derivative works.
But it also provides for an exception. Known as “fair use,” it permits the use of copyrighted material without the copyright holder’s permission for content such as criticism, comment, news reporting, teaching and research.
On the one hand, “we want to allow people who have currently invested time, money, creativity to reap the rewards of what they have done,” said Sean O’Connor, a professor of law at George Mason University. “On the other hand, we don’t want to give them such strong rights that we inhibit the next generation of innovation.”
Is AI ‘scraping’ fair use?
The development of generative AI tools is testing the limits of “fair use,” pitting content creators against technology companies, with the outcome of the dispute promising wide-ranging implications for innovation and society at large.
In the 10 months since ChatGPT’s groundbreaking launch, AI companies have faced a rapidly increasing number of lawsuits over content used to train generative AI tools. The plaintiffs are seeking damages and want the courts to end the alleged infringement.
In January, three visual artists filed a proposed class-action lawsuit against Stability AI Ltd. and two others in San Francisco, alleging that Stability “scraped” more than 5 billion images from the internet to train its popular image generator Stable Diffusion, without the consent of copyright holders.
Stable Diffusion is a “21st-century collage tool” that “remixes the copyrighted works of millions of artists whose work was used as training data,” according to the lawsuit.
In February, stock photo company Getty Images filed its own lawsuit against Stability AI in both the United States and Britain, saying the company copied more than 12 million photos from Getty’s collection without permission or compensation.
In June, two U.S.-based authors sued OpenAI, the creator of ChatGPT, claiming the company’s training data included nearly 300,000 books pulled from illegal “shadow library” websites that offer copyrighted books.
“A large language model’s output is entirely and uniquely reliant on the material in its training dataset,” the lawsuit says.
Last month, American comedian and author Sarah Silverman and two other writers sued OpenAI and Meta, the parent company of Facebook, over the same claims, saying their chatbots were trained on books that had been illegally acquired.
The lawsuit against OpenAI includes what it describes as “very accurate summaries” of the authors’ books generated by ChatGPT, suggesting the company illegally “copied” and then used them to train the chatbot.
The artificial intelligence companies have rejected the allegations and asked the courts to dismiss the lawsuits.
In a court filing in April, Stability AI, research lab Midjourney and online art gallery DeviantArt wrote that visual artists who sue “fail to identify a single allegedly infringing output image, let alone one that is substantially similar to any of their copyrighted works.”
For its part, OpenAI has defended its use of copyrighted material as “fair use,” saying it pulled the works from publicly available datasets on the internet.
The cases are slowly making their way through the courts. It is too early to say how judges will decide.
Last month, a federal judge in San Francisco said he was inclined to toss out most of a lawsuit brought by the three artists against Stability AI but indicated that the claim of direct infringement may continue.
“The big question is fair use,” said Robert Brauneis, a law professor and co-director of the Intellectual Property Program at George Washington University. “I would not be surprised if some of the courts came out in different ways, that some of the cases said, ‘Yes, fair use.’ And others said, ‘No.’”
If the courts are split, the question could eventually go to the Supreme Court, Brauneis said.
Assessing copyright claims
Training generative AI tools to create new works raises two legal questions: Is the data use authorized? And is the new work it creates “derivative” or “transformative”?
The answer is not clear-cut, O’Connor said.
“On the one hand, what the supporters of the generative AI models are saying is that they are acting not much differently than we as humans would do,” he said. “When we read books, watch movies, listen to music, and if we are talented, then we use those to train ourselves as models.
“The counterargument is that … it is categorically different from what humans do when they learn how to become creative themselves.”
While artificial intelligence companies claim their use of the data is fair, O’Connor said they still have to prove that the use was authorized.
“I think that’s a very close call, and I think they may lose on that,” he said.
On the other hand, the AI models can probably avoid liability for generating content that “seems sort of the style of a current author” but is not the same.
“That claim is probably not going to succeed,” O’Connor said. “It will be seen as just a different work.”
But Brauneis said content creators have a strong claim: The AI-generated output will likely compete with the original work.
Imagine you’re a magazine editor who wants an illustration to accompany an article about a particular bird, Brauneis suggested. You could do one of two things: Commission an artist or ask a generative AI tool like Stable Diffusion to create it for you. After a few attempts with the latter, you’ll probably get an image that you can use.
“One of the most important questions to ask about in fair use is, ‘Is this use a substitute, or is it competing with the work of art that is being copied?’” Brauneis said. “And the answer here may be yes. And if it is [competing], that really weighs strongly against fair use.”
This is not the first time that technology companies have been sued over their use of copyrighted material.
In 2015, the Authors Guild filed a class-action lawsuit against Google and three university libraries over Google’s digital books project, alleging “massive copyright infringement.”
In 2014, an appeals court ruled that the project, by then renamed Google Books, was protected under the fair use doctrine.
In 2007, Viacom sued both Google and YouTube for allowing users to upload and view copyrighted material owned by Viacom, including complete episodes of TV shows. The case was later settled out of court.
For Brauneis, the current “Wild West era of creating AI models” recalls YouTube’s freewheeling early days.
“They just wanted to get viewers, and they were willing to take a legal risk to do that,” Brauneis said. “That’s not the way YouTube operates now. YouTube has all sorts of precautions to identify copyrighted content that has not been permitted to be placed on YouTube and then to take it down.”
Artificial intelligence companies may make a similar pivot.
They may have justified using copyrighted material to test out their technology. But now that their models are working, they “may be willing to sit down and think about how to license content,” Brauneis said.