Copyright Liability On LLMs Should Mostly Fall On The Prompter, Not The Service
Techdirt. Stories filed under "fair use" 2024-01-05
The technological marvel of large language models (LLMs) like ChatGPT, developed by AI engineers and experts, has posed a unique challenge in the realm of copyright law. These advanced AI systems, which undergo extensive training on diverse datasets, including copyrighted material, and provide output highly dependent on user “prompts,” have raised questions about the bounds of fair use and the responsibilities of both the AI developers and users.
Building upon the Sony Doctrine, which protects dual use technologies with substantial non-infringing uses, I propose the TAO (“Training And Output”) Doctrine for AI LLMs like chatGPT, Claude, and Bard. This AI Doctrine recognizes that if a good faith AI LLM engine is trained using copyrighted works, where the (1) original work is not replicated but rather used to develop an understanding, and (2) the outputs generated are based on user prompts, the responsibility for any potential copyright infringement should lie with the user, not the AI system. This approach acknowledges the “dual-use nature” of AI technologies and emphasizes the crucial role of user intent and inputs such as prompts and URLs in determining the nature of the output and any downstream usage.
Understanding LLMs and Their Training Mechanism
LLMs operate by analyzing and synthesizing vast amounts of text data. Their ability to generate responses, write creatively, and even develop code stems from this training. However, unlike traditional methods of copying, LLMs like ChatGPT engage in a complex process of learning and generating new content based on patterns and structures learned from their training data. This process is akin to a person learning a language through various sources but then using that language independently to create new sentences. AI LLMs are important for the advancement of society as they are “idea engines” that allow for the efficient processing and sharing of ideas.
Copyright law does not protect facts, ideas, procedures, processes, systems, methods of operation, concepts, principles, or discoveries, even if they are expressed in copyrighted works. This principle implies that the syntactical, structural, and linguistic elements extracted during the training of LLMs fall outside the scope of copyright protection. The use of texts to train LLMs primarily involves analyzing these non-copyrightable elements to understand and statistically model language patterns.
The training of LLMs aligns with the principles of fair use as it involves an historically important transformative process that extends beyond the mere replication of copyrighted texts. It harnesses the non-copyrightable elements of language to create something new and valuable, without diminishing the market value of the original works. The LLM technology has brought society into the age of idea processors. Under the totality of the circumstance the use of texts to train LLMs can be considered fair use under current copyright law.
The Proposed Sony Doctrine for AI LLMs or the “Training and Output” (“TAO”) Doctrine
The training of AI large language models (LLMs) on copyrighted works, and their subsequent outputs from user prompts, presents a compelling case for being recognized as a form of dual use technology. This recognition could be encapsulated in what might be termed the “AI Training and Output” (“TAO”) Doctrine protecting developers from copyright infringement liability. Drawing parallels from the Sony Doctrine, which protected the manufacture of dual-use technologies like the VCR under the premise that they are capable of substantial non-infringing uses, the AI TAO Doctrine could safeguard AI development and deter the floodgates of litigation.
LLMs, like the VCR, have a dual-use nature. They are capable of transient and modest infringing activities when prompted or used inappropriately by users, but more significantly, they possess a vast potential for beneficial, non-infringing uses such as educational enrichment, idea enhancements, and advances in language processing. The essence of the AI TAO Doctrine would center on this dual-use characteristic, emphasizing the substantial, legitimate applications of AI that far outweigh potential abuses.
Protecting developers of LLM training and automated output under such a doctrine aligns with fostering innovation and technological advancement while recognizing the need for responsible use. The AI TAO Doctrine would not fully absolve good faith AI developers from implementing robust safeguards against copyright infringement but would acknowledge the inherent dual-use nature of AI technologies, thereby promoting a balanced approach to copyright considerations in AI development.
User Responsibility and AI Outputs
Users play a pivotal role in how LLMs are utilized. Now, consider the user “prompt,” the user’s recipe instruction. An LLM presented with a prompt regarding a copyrighted article can cook up fair or foul outputs. A thoughtful “summarize and critique” prompt extracting key points and offering analysis falls squarely under fair use. It’s like taking notes and forming opinions after reading a book – a crucial aspect of learning and criticism. The LLM used this way is an excellent “idea processor.”
However, a prompt demanding the LLM “regurgitate the entire article word-for-word” may cross the line. Did the team of “users” involved in the NY Times v. OpenAI complaint go too far and game the technology or act with unclean hands when they formulated aggressive prompts? Should the law tolerate such transient and rare gaming in favor of the greater good that LLMs have to offer? It’s likely that as that case moves forward, we’ll learn a lot about the process through which the eye-catching results were generated. The likelihood is that AI LLMs will continue to be tuned against such similar copyright “gaming” incidents.
But, there are still user contexts where such a verbatim output and subsequent usage can constitute fair use. Some degree of copyright “gaming” may need to be tolerated in order for fair use and free speech to flourish. For an ironic example, an LLM assisted legal analysis of the NY Times v. OpenAI complaint, by its nature, should include the fair use of all parts of the complaint including the NY Times’ “copyrighted” articles embedded in that court document. Wholesale “blocking” of such article text in LLM output is likely over-broad especially for legal scholars.
Copyright infringement involving LLM output is also highly dependent on how the user uses the content off site or downsteam. Users like students, news organizations, and lawyers using the same outputs may each have a different copyright infringement analysis for their off site use cases. One user’s fair use is another user’s infringement. If a user prompts an LLM to generate content that is used in a manner that infringes on copyright, the responsibility should lie with the user not the LLM.
Here’s the crux: the LLM itself can’t predict the user’s intentions. It simply processes patterns based on prompts. The LLM learning machine and idea processor shouldn’t be stifled due to potential user misuse. Instead, in the rare circumstances when there is a legitimate copyright infringement, users ought to be held accountable for their prompts and subsequent usage and give the AI LLM “dual use technology” developers the non-infringing status of the VCR manufacturer under the Sony Doctrine.
This approach would benefit from developing privacy oriented AI systems that can flag potentially infringing uses and guide users towards responsible usage, thereby fostering a culture of awareness and accountability in the digital domain. Care must be taken to not intrude on user privacy by analyzing private prompts and outputs and storing them which can reveal the most sensitive information about a person from health care concerns to trade secrets.
Ironically, AI LLMs retaining the original copyrighted works, under the fair use doctrine, to hash portions and to bolster fingerprint technologies, can help create better copyright infringement filtering, feedback, and alert systems. How to balance copyright risk with innovation requires a holistic approach by all the stakeholders. The stakes are high. Too much friction and other countries with a “technology first” set of policies will take the lead in global AI.
The training of LLMs on copyrighted material, under the umbrella of fair use and subsequent outputs in response to user prompts within the proposed AI TAO Doctrine, presents a balanced approach to fostering innovation while respecting copyright laws. This perspective emphasizes the transformative nature of AI training, the importance of user intent in the generation of outputs, and the need for technological tools to assist in responsible usage. Such an approach not only supports the advancement of AI technology but also upholds the principles of intellectual property rights, ensuring a harmonious coexistence of technological innovation and copyright law.
Ira P. Rothken is a leading advisor and legal counsel to companies in the social network, entertainment, internet, cloud services, and videogame industries.