GenLaw 2024
The Laboratorium 2024-07-27
I’m virtually attending the GenLaw 2024 workshop today, and I will be liveblogging the presentations.
Introduction
A. Feder Cooper and Katherine Lee: Welcome!
The generative AI supply chain includes many stages, actors, and choices. But wherever there are choices, there are research questions: how ML developers make those choices? And wherever there are choices, there are policy questions: what are the consequences for law and policy of those choices?
GenLaw is not an archival venue, but if you are interested in publishing work in this space, consider the ACM CS&Law conference, happening next in March 2025 in Munich.
Kyle Lo
Kyle Lo, Demystifying Data Curation for Language Models.
I think of data in three stages:
- Shopping for data, or acquiring it.
- Cooking your data, or transforming it.
- Tasting your data, or testing it.
Someone once told me, “Infinite tokens, you could just train on the whole Internet.” Scale is important. What’s the best way to get a lot of data? Our #1 choice is public APIs leading to bulk data. 80 to 100% of the data comes from web scrapers (CommonCrawl, Internet Archive, etc.). These are nonprofits that have been operating long before generative AI was a thing. A small percentage (about 1%) is user-created content like Wikipedia or ArXiv. And about 5% or less is open publishers, like PubMed. Datasets also heavily remix existing datasets.
Nobody crawls the data themselves unless they’re really big and have a lot of good programmers. You can either do deep domain-specific crawls, or a broad and wide crawl. A lot of websites require you to follow links and click buttons to get at the content. Writing the code to coax out this content—hidden behind JS—requires a lot of site-specific code. For each website, one has to ask whether going through this is worth the trouble.
It’s also getting harder to crawl. A lot more sites have robots.txt that ask not to be crawled or have terms of service restricting crawling. This makes CommonCrawl’s job harder. Especially if you’re polite, you spend a lot more energy working through a decreasing pile of sources. More data is now available only to those who pay for it. We’re not running out of training data, we’re running out of open training data, which raises serious issues of equitable access.
Moving on to transformation, the first step is to filter out low-quality pages (e.g., site navigation or r/microwavegang). You typically need to filter out sensitive data like passwords, NSFW content, and duplicates.
Next is linearization: remove header text, navigational links on pages, etc., and convert to a stream of tokens. Poor linearization can be irrecoverable. It can break up sentences and render source content incoherent.
There is filtering: cleaning up data. Every data source needs its own pipeline! For example, for code, you might want to include Python but not Fortran. Training on user-uploaded CSVs in a code repository is usually not helpful.
Ssing small-model classifiers to do filtering has side effects. There are a lot of terms of service out there. If you do deduplication, you may wind up throwing out a lot of terms of service. Removing PII with low-precision classifiers can have legal consequences. Or, sometimes we see data that includes scientific text in English and pornography in Chinese—a poor classifier will misunderstand it.
My last point: people have pushed for a safe harbor for AI research. We need something similar for open-data research. In doing open research, am I taking on too much risk?
Gabriele Mazzini
Gabriele Mazzini, Introduction to the AI Act and Generative AI.
The AI Act is a first-of-its-kind in the world. In the EU, the Commission proposes legislation and also implements it. The draft is send to the Council, which represents governments of member states, and to the Parliament, which is directly elected. The Council and Parliament have to agree to enact legislation. Implementation is carried out via member states. The Commission can provide some executive action and some guidance.
The AI Act required some complex choices: it should be horizontal, applying to all of AI, rather than being sector-specific. But different fields do have different legal regimes (e.g. financial regulation).
The most important concept in the AI Act is its risk-based approach. The greater the risk, the stricter the rules—but there is no regulation of AI as such. It focuses on use cases, with stricter rules for riskier uses.
- From the EU’s point of view, a few uses–such as social scoring—are unacceptable risk and prohibited.
- The high-risk category covers about 90% of the rules in the AI Act. This includes AI systems that are safety components of physical products (e.g. robotics). It also includes some specifically listed uses, such as recruitment in employment. These AI systems are subject to compliance with specific requirements ex ante.
- The transparency risk category requires disclosures (e.g. that you are interacting with an AI chatbot and not a human). This is where generative AI mostly comes in: that you know that content was created by AI.
- Everything else is minimal or no risk and is not regulated.
Most generative AI systems are in the transparency category (e.g. disclosure of training data). But some systems, e.g. those trained over a certain compute threshold, are subject to stricter rules.
Martin Senftleben
Martin Senftleben, Copyright and GenAI Development – Regulatory Approaches and Challenges in the EU and Beyond
AI forces us to confront the dethroning of the human author. Copyright has long been based on the unique creativity of human authors, but now generative AI generate outputs that appear as though they were human-generated.
In copyright, we give one person a monopoly right to decide what can be done with a work, but that makes follow-on innovation difficult. That was difficult enough in the past, when the follow-on innovation came from other authors (parody, pastiche, etc.). Here, the follow-on innovation comes from the machine. Copyright policy makes this complex right now. It’s an attempt to reconcile fair renumeration for human authors with a successful AI sector.
The copyright answer would be licensing—on the input side, pay for each and every piece of data that goes into the data set, and on the output side, pay for outputs. If you do this, you get problems for the AI sector. You get very limited access to data, with a few large players paying for data from publishers, but others getting nothing. This produces bias in the sense that it only reflects mainstream inputs (English, but not Dutch and Slovak).
If you try to favor a vibrant AI sector, you don’t require licensing for training and you make all the outputs legal (e.g. fair use). This increases access and you have less bias on the output, but you have no remuneration for authors.
From a legal-comparative perspective, it’s fascinating to see how different legislators approach these questions. Japan and Southeast Asian countries have tried to support AI developers, e.g. broad text and data mining (TDM) exemptions as applied to AI training. In the U.S., the discussion is about fair use and there are about 25 lawsuits. Fair use opens up the copyright system immediately because users can push back.
In the E.U., forget about fair use. We have the directive on the Digital Single Market in 2019, which was written without generative AI in mind. The focus was on scientific TDMs. That exception doesn’t cover commercial or even non-profit activity, only scientific research. A research organization can work with a private partner. There is also a broader TDM exemption that enables TDM unless the copyright owner has opted out using “machine-readable means” (e.g. in robots.txt).
The AI Act makes things more complex; it has AI-related components. It confirms that reproductions for TDM are still within the scope of copyright and require an exemption. It confirms that opt-outs must be observed. What about training in other countries? If you at a later stage want to offer your trained models in the EU, you must have evidence that you trained in accordance with EU policy. This is an intended Brussels effect.
The AI Act also has transparency obligations: specifically a “sufficiently detailed summary of the content used for training.” Good luck with that one! Even knowing what’s in the datasets you’re using is a challenge. There will be an AI Office, which will set up a template. Also, is there a risk that AI trained in the EU will simply be less clever than AI trained elsewhere? That it will marginalize the EU cultural heritage?
That’s where we stand the E.U. Codes of practice will start in May 2025 and become enforceable against AI providers in August 2025. If you seek licenses now, make sure they cover the training you have done in the past.
Panel: Data Curation and IP
Panelists: Julia Powles, Kyle Lo, Martin Senftleben, A. Feder Cooper (moderator)
Cooper: Julia, tell us about the view from Australia.
Julia: Outside the U.S., copyright law also includes moral rights, especially attribution and integrity. Three things: (1) Artists are feeling disempowered. (2) Lawyers gotten preoccupied with where (geographically) acts are taking place. (3) Governments are in a giant game of chicken of who will insist that AI providers comply. Everyone is waiting for artists to mount challenges that they don’t have the resources to mount. Most people who are savvy about IP hate copyright. We don’t show the concern that we show for the AI industry for students or others who are impacted by copyright. Australia is being very timid, as are most countries.
Cooper: Martin, can you fill us in on moral rights?
Martin: Copyright is not just about the money. It’s about the personal touch of what we create as human beings. Moral rights:
- To decide whether a work will be made available to the public at all.
- Attribution, to have your name associated with the work.
- Integrity, to decide on modifications to the work.
- Integrity, to object to the use of the work in unwanted contexts (such as pornography).
The impact on AI training is very unclear. It’s not clear what will happen in the courts. Perhaps moral rights will let authors avoid machine training entirely. Or perhaps they will apply at the output level. Not clear whether these rights will fly due to idea/expression dichotomy.
Cooper: Kyle, can you talk about copyright considerations in data curation?
Kyle: I’m worried about: (1) it’s important to develop techniques for fine-tuning, but (2) will my company let me work on projects where we hand off the control to others? Without some sort of protection for developing unlearning, we won’t have research on these techniques.
Cooper: Follow-up: you went right to memorization. Are we caring too much about memorization?
Kyle: There’s a simplistic view that I want to get away from: that it’s only regurgitation that matters. There are other harmful behaviors, such as a perfect style imitator for an author. It’s hard to form an opinion about good legislation without knowledge of what the state of the technology is, and what’s possible or not.
Julia: It feels like the wave of large models we’ve had in the last few years have really consumed our thinking about the future of AI. Especially the idea that we “need” scale and access to all copyrighted works. Before ChatGPT, the idea was that these models were too legally dangerous to release. We have impeded the release of bioscience because we have gone through the work of deciding what we want to allow. In many cases, having the large general model is not the best solution to a problem. In many cases, the promise remains unrealized.
Martin: Memorization and learning of concepts is one of the most fascinating and different problems. From a copyright perspective, getting knowledge about the black box is interesting and important. Cf. Matthew Sag’s “Snoopy problem.” CC licenses often come with a share-alike restriction. If it can be demonstrated that there are traces of this material in fully-trained models, those models would need to be shared under those terms.
Kyle: Do we need scale? I go back and forth on this all the time. On the one hand,I detest the idea of a general-purpose model. It’s all domain effects. That’s ML 101. On the other hand, these models are really impressive. The science-specific models are worse than GPT-4 for their use case. I don’t know why these giant proprietary models are so good. The more I deviate my methods from common practice, the less applicable my findings are. We have to hyperscale to be relevant, but I also hate it.
Cooper: How should we evaluate models?
When I work on general-purpose models, I try to reproduce what closed models are doing. I set up evaluations to try to replicate how they think. But I haven’t even reached the point of being able to reproduce their results. Everyone’s hardware is different and training runs can go wrong in lots of ways.
When I work on smaller and more specific models, not very much has change. The story has been to focus on the target domain, and that’s still the case. It’s careful scientific work. Maybe the only wrench is that general-purpose models can be prompted for outputs that are different than the ones they were created to focus on.
Cooper: Let’s talk about guardrails.
Martin: Right now, the copyright discussion focuses on the AI training stage. In terms of costs, this means that AI training is burdened with copyright issues, which makes training more expensive. Perhaps we should diversify legal tools by moving from input to output. Let the trainers do what they want, and we’ll put requirements on outputs and require them to create appropriate filters.
Julia: I find the argument that it’ll be too costly to respect copyright to be bunk. There are 100 countries that have to negotiate with major publishers for access to copyrighted works. There are lots of humans that we don’t make these arguments for. We should give these permissions to humans before machines. It seems obvious that we’d have impressive results at hyperscale. For 25 years, IP has debated traditional cultural knowledge. There, we have belatedly recognized the origin of this knowledge. The same goes for AI: it’s about acknowledging the source of the knowledge they are trained on.
Turning to supply chains, in addition to the copying right, there are authorizing, importing, and communicating, plus moral rights. An interesting avenue for regulation is to ask where sweatshops of people doing content moderation and data labeling take place.
Cooper: Training is resource-intensive, but so is inference.
Question: Why are we treating AI differently than biotechnology?
Julia: We have a strong physical bias. Dolly the sheep had an impact that 3D avatars didn’t. Also, it’s different power players.
Martin: Pam Samuelson has a good paper on historical antecedents for new copying technologies. Although I think that generative AI dethrones human authors and that is something new.
Kyle: AI is a proxy for other things; it doesn’t feel genuine until it’s applied.
Question: There have been a lot of talks about the power of training on synthetic data. Is copyright the right mechanism for training on synthetic data?
Kyle: It is hard to govern these approaches on the output side, you would really have to deal with it on the input side.
Martin: I hate to say this as a lawyer, but … it depends.
Question: We live in a fragmented import/export market. (E.g., the data security executive order
Martin: There have been predictions that territoriality will die, but so far it has persisted.
Connor Dunlop
Connor Dunlop, GPAI Governance and Oversight in the EU – And How You Might be Able to Contribute
Three topics:
- Role of civil society
- My work and how we fit in
- How you can contribute
AI operates within a complex system of social and economic structures. The ecosystem includes industry and more. AI and society includes government actors and NGOs exist to support those actors. There are many types of expertise involved here. Ada Lovelace is an organization that thinks abut how AI and data impact people in society. We aim for research expertise, promoting AI literacy, building technical tools like audits and evaluations. A possible gap in the ecosystem is strategic litigation expertise.
At Ada Lovelace, we try to identify key topics early on and ground them in research. We do a lot of polling and engagement on public perspectives. And we recognize nuance and try to make sure that people know what the known unknowns are and where people disagree.
On AI governance, we have been asking about different accountability mechanisms. What mechanisms are available, how are they employed in the real world, do they work, and can they be reflected in standards, law, or policy?
Sabrina Küspert
Sabrina Küspert, Implementing the AI Act
The AI Act follows a risk-based approach. (Review of risk-based approach pyramid.) It adopts harmonized rules across all 27 member states. The idea is that if you create trust, you also create excellence. If provider complies, they get access to the entire EU.
For general-purpose models, the rules are transparency obligations. Anyone who wants to build on a general-purpose model should be able to understand its capabilities and what it is based on. Providers must mitigate systemic risks with evaluation, mitigation, cybersecurity, incident reporting, and corrective measures.
The EU AI Office is part of the Commission and the center of AI expertise for the EU. It will facilitate a process to detail the rules around transparency, copyright, risk assessment, and risk mitigation via codes of practice. Also building enforcement structures. It will have technical capacity and regulatory powers (e.g. to compel assessments).
Finally, we’re facilitating international cooperation on AI. We’re working with the U.S. AI Safety Office, building an international network among key partners, and engaged in bilateral and multilateral activities.