Training Generative AI Models on Copyrighted Works Is Fair Use

peter.suber's bookmarks 2024-01-24

Last Updated on January 23, 2024, 4:48 pm ET

ARL and ALA are founding members of the Library Copyright Alliance (LCA). ARL, ALA, and LCA are not involved in the litigation discussed in this post.

Among the proliferating AI-related litigation, the New York Times filed a copyright infringement lawsuit against Microsoft and OpenAI. Along with other allegations, the New York Times claims that Microsoft and OpenAI are infringing copyright when they train their large language models (LLMs) on material copyrighted by the Times.

OpenAI has responded that “training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents.” In a blog post about the case, OpenAI cites the Library Copyright Alliance (LCA) position that “based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.” LCA explained this position in our submission to the US Copyright Office notice of inquiry on copyright and AI, and in the LCA Principles for Copyright and AI.

LCA is not involved in any of the AI lawsuits. But as champions of fair use, free speech, and freedom of information, libraries have a stake in maintaining the balance of copyright law so that it is not used to block or restrict access to information. We drafted the principles on AI and copyright in response to efforts to amend copyright law to require licensing schemes for generative AI that could stunt the development of this technology, and undermine its utility to researchers, students, creators, and the public. The LCA principles hold that copyright law as applied and interpreted by the Copyright Office and the courts is flexible and robust enough to address issues of copyright and AI without amendment. The LCA principles also make the careful and critical distinction between input to train an LLM, and output—which could potentially be infringing if it is substantially similar to an original expressive work.

On the question of whether ingesting copyrighted works to train LLMs is fair use, LCA points to the history of courts applying the US Copyright Act to AI. For instance, under the precedent established in Authors Guild v. HathiTrust and upheld in Authors Guild v. Google, the US Court of Appeals for the Second Circuit held that mass digitization of a large volume of in-copyright books in order to distill and reveal new information about the books was a fair use. While these cases did not concern generative AI, they did involve machine learning. The courts now hearing the pending challenges to ingestion for training generative AI models are perfectly capable of applying these precedents to the cases before them.

Why are scholars and librarians so invested in protecting the precedent that training AI LLMs on copyright-protected works is a transformative fair use? Rachael G. Samberg, Timothy Vollmer, and Samantha Teremi (of UC Berkeley Library) recently wrote that maintaining the continued treatment of training AI models as fair use is “essential to protecting research,” including non-generative, nonprofit educational research methodologies like text and data mining (TDM). If fair use rights were overridden and licenses restricted researchers to training AI on public domain works, scholars would be limited in the scope of inquiries that can be made using AI tools. Works in the public domain are not representative of the full scope of culture, and training AI on public domain works would omit studies of contemporary history, culture, and society from the scholarly record, as Authors Alliance and LCA described in a recent petition to the US Copyright Office. Hampering researchers’ ability to interrogate modern in-copyright materials through a licensing regime would mean that research is less relevant and useful to the concerns of the day.

As the lawsuits illustrate, the availability of generative AI trained on datasets that include copyrightable material has raised questions about the intersection of copyright law and AI. But as discussed above, many of the questions raised have already been litigated. Nick Garcia, policy counsel at Public Knowledge, pointed out during a recent Chamber of Progress panel on AI, art, and copyright that concerns about web crawling to collect data—a practice that the Times takes issue with in its lawsuit—have been around for decades, and courts have found web crawling to be a fair use.

New York Times v. Microsoft et al. is, of course, just one legal battle through which the courts will interpret copyright law in the US, and it may be years before these cases are settled. Copyright law as it applies to AI will also be informed by the US Copyright Office Study, which will culminate in a report this year. LCA will monitor these lawsuits and pursue opportunities to advance the interests of scholars, educators, students, and the public via selected amicus briefs and discussions of the issues and the range of library concerns with legislators and regulators.

The post Training Generative AI Models on Copyrighted Works Is Fair Use appeared first on Association of Research Libraries.