AI2 Dolma: 3 Trillion Token Open Corpus for LLMs | AI2 Blog
peter.suber's bookmarks 2023-09-07
"Since March, we at the Allen Institute for AI have been creating OLMo, an open language model to promote the study of large-scale NLP systems. One of our major goals is to build OLMo in a transparent and open manner by releasing artifacts and documenting processes we followed throughout this project. Today, we release our first data artifact in this project — Dolma¹, a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Openly available for download on the HuggingFace Hub under AI2’s ImpACT license, Dolma is the largest open dataset to date...."