From software heritage to code commons: A vision for transparent and responsible AI in code-based model training | Scuola Superiore Sant'Anna

Hanna_S's bookmarks 2024-12-09

Summary:

"There is a strong interplay between software development and machine learning: AI models are providing new tools to develop software, while the inclusion of large pubicly available codebases in training datasets helps improve large language models’ reasoning abilities, well beyond coding tasks. In the specific domain of source code the issue of transparency of the training dataset assumes a special weight in the broader debate around open versus closed models.

Software Heritage, launched by Inria and in partnership with UNESCO, has been building the largest archive of publicly available source code for nearly a decade, and provides today the Software Hash Identifier for the over 50 billion software artifacts it collected from over 300 million projects, ensuring availability, guaranteeing integrity and enabling traceability of all its contents. Because of the core values that inform its approach to open access and code preservation, it is naturally concerned by these challenges.

In this talk we will start from the principled stance on the use of the Software Heritage archive for training models, report on the lessons learned from the collaboration with the BigCode project that created StarCoder2, and then focus on the challenges, ethical considerations, and technical limitations that arise in the current approaches to use open codebases in AI, in particular when it comes to transparency, accountability, and resource efficiency. These limitations underscore the need for a Code Commons: a dedicated initiative to expand Software Heritage into a central resource for transparency, quality, accountability, and sustainability in machine learning on code. By promoting transparency and responsible stewardship, Software Heritage aims to help researchers, developers, and organizations navigate the challenges of AI in code-based applications. This talk invites all stakeholders to collaborate on this ambitious vision."

Link:

https://www.santannapisa.it/it/node/298984

From feeds:

Open Access Tracking Project (OATP) » Hanna_S's bookmarks

Tags:

oa.new oa.events oa.software_heritage oa.code oa.ai oa.software

Date tagged:

12/09/2024, 07:17

Date published:

12/09/2024, 02:17