New Study Suggests ChatGPT Vulnerability with Potential Privacy Implications
newsletter via Feeds on Inoreader 2023-11-30
Summary:
Prithvi Iyer is Program Manager at Tech Policy Press.
What would happen if you asked OpenAI’s ChatGPT to repeat a word such as “poem” forever? A new preprint research paper reveals that this prompt could lead the chatbot to leak training data, including personally identifiable information and other material scraped from the web. The results, which have not been peer reviewed, raise questions about the safety and security of ChatGPT and other large language model (LLM) systems.
“This research would appear to confirm once again why the ‘publicly available information’ approach to web scraping and training data is incredibly reductive and outdated,” Justin Sherman, founder of Global Cyber Strategies, a research and advisory firm, told Tech Policy Press.
The researchers – a team from Google DeepMind, the University of Washington, Cornell, Carnegie Mellon, University of California Berkeley, and ETH Zurich – explored the phenomenon of “extractable memorization,” which is when an adversary extracts training data by querying a machine learning model (in this case, asking ChatGPT to repeat the word “poem” forever”). With open source models that make their model weights and training data publicly available, training data extraction is easier. However, models like ChatGPT are “aligned” with human feedback, which is supposed to prevent the model from “regurgitating training data.”
Before discussing the potential data leak and its implications for privacy, it is important to understand how the researchers were able to verify if the generated output was part of the training data despite the fact that ChatGPT does not make its training set publicly available. To circumvent the proverbial “blackbox”, the researchers first downloaded a large corpus of text from the Internet to build an auxiliary dataset which is then cross referenced with the text generated by the chatbot. This auxiliary dataset had 9 terabytes of text, combining four of the largest open LLM pre-training datasets.If a sequence of words appears verbatim in both cases, it is unlikely to be a coincidence, making it an effective proxy for testing if the generated text was part of the training data. This approach is similar to previous efforts at training data extraction, wherein researchers like Carlini et.al performed manual Google searches to verify if the data extracted corresponded to training sets.
Recovering training data from ChatGPT is not easy. The researchers had to find a way to make the model “escape” out of its alignment training and fall back on the base model, causing it to issue responses that mirror the data it was initially trained on. So when the authors asked ChatGPT to repeat the word “poem” forever, it initially repeated that word “several hundred times” but eventually it diverged and started spitting out “nonsensical” information.
However, a small fraction of the text generated was “copied directly from the pre-training data.” What’s even more worrisome is that by merely spending $200 to issue queries to ChatGPT, the authors were able to “extract over 10,000 unique verbatim- memorized training examples.” The memorizations reflected a diverse range of text sources, including:
- Personal Identifiable Information: The attack led the chatbot to reveal the personal information of dozens of individuals. This included names, email addresses, phone numbers, personal website URLs etc.
- NSFW content: When asked to repeat a NSFW word instead of the word “poem” forever, the authors found explicit content, dating websites and content related to guns and war. These are precisely the kinds of content that AI developers claim to mitigate via red-teaming and other safety checks.
- Literature and Academic Articles: The model also spit out Verbatim paragraphs of text from published books and poems. For example, The Raven by Edgar Allen Poe. Similarly, the authors successfully extracted snippets of text from academic articles and bibliographic information of numerous authors.This is especially worrying as much of this content is proprietary and the fact that published works are part of training data without compensating the authors raises questions about ownership and safeguarding the rights of creators and academics alike.
Interestingly, when the authors increased the size of their auxiliary datas