GenLaw DC Workshop

The Laboratorium 2024-04-15

I’m at the GenLaw workshop on Evaluating Generative AI Systems: the Good, the Bad, and the Hype today, and I will be liveblogging the presentations.

Introduction

Hoda Heidari: Welcome! Today’s event is sponsored by the K&L Gates Endowment at CMU, and presented by a team from GenLaw, CDT, and the Georgetown ITLP.

Katherine Lee: It feels like we’re at an inflection point. There are lots of models, and they’re being evaluated against each other. There’s also a major policy push. There’s the Biden executive order, privacy legislation, the generative-AI disclosure bill, etc.

All of these require the ability to balance capabilities and risks. The buzzword today is evaluations. Today’s event is about what evaluations are: ways of measuring generative-AI systems. Evaluations are proxies for things we care about, like bias and fairness. These proxies are limited, and we need many of them. And there are things like justice that we can’t even hope to measure. Today’s four specific topics will explore the tools we have and their limits.

A. Feder Cooper: Here is a concrete example of the challenges. One popular benchmark is MMLU. It’s advertised as testing whether models “possess extensive world knowledge and problem solving ability.” It includes multiple-choice questions from tests found online, standardized tests of mathematics, history, computer science, and more.

But evaluations are surprisingly brittle; CS programs don’t always rely on the GRE. In addition, it’s not clear what the benchmark measures. In the last week, MMLU has come under scrutiny. It turns out that if you reorder the questions as you give them to a language model, you get wide variations in overall scores.

This gets at another set of questions about generative-AI systems. MMLU benchmarks a model, but systems are much more than just models. Most people interact with a deployed system that wraps the model in a program with a user interface, filters, etc. There are numerous levels of indirection between the user’s engagement and the model itself.

And even the system itself is embedded in a much larger supply chain, from data through training to alignment to generation. There are numerous stages, each of which may be carried out by different actors doing different portions. We have just started to reason about all of these different actors and how they interact with each other. These policy questions are not just about technology, they’re about the actors and interactions.

Alexandra Givens: CDT’s work involves making sure that policy interventions are grounded in a solid understanding of how technologies work. Today’s focus is on how we can evaluate systems, in four concrete areas:

  1. Training-data attribution
  2. Privacy
  3. Trust and safety
  4. Data provenance and watermarks

We also have a number of representatives from government giving presentations on their ongoing work.

Our goals for today are on providing insights on cross-disciplinary, cross-community engagement. In addition, we want to pose concrete questions for research and policy, and help people find future collaborators.