AI coding assistants are quickly becoming indispensable tools for developers. But the provenance of the code they’re trained on is often murky, leading to concerns around transparency and author rights. A new initiative launched today by the nonprofit Software Heritage hopes to change this by providing the world’s largest repository of ethically-sourced code for training AI.
The large language models (LLM) that underlie chatbots and coding assistants are trained on vast reams of data scraped from the Internet. But AI developers rarely provide details of what’s included in their training datasets, says Roberto Di Cosmo, director of Software Heritage. This makes it hard to reproduce results, understand whether models are trained on data from benchmark tests, and for developers to control whether their code is used to train AI.
Software Heritage thinks it can help change this situation. The organization was founded in 2016 to collect and preserve all publicly available source code. By web crawling code hosting platforms like Bitbucket, GitHub, and the Python Package Index, Software Heritage has built up a collection of more than 22 billion source files from around 345 million projects in more than 600 programming languages.
Using AI’s Largest Training Dataset for Good
The project’s goal is to create a freely accessible archive of the world’s digital heritage, but following recent rise of LLMs, Di Cosmo says they quickly realized they were sitting on a goldmine. “After the ChatGPT explosion, it became clear rather quickly that we have at Software Heritage the largest dataset for training AI models on code in the world,” he says.
So now the group is launching a project called CodeCommons, which will provide access to those willing to sign up to ethical principles aimed at boosting transparency and accountability in AI training. The group has secured €5 million (about US $5.2 million) from the French government over the next two years to build the supporting technology, with a kick-off event held in Paris today to start the development process.
Software Heritage originally published ethical principles for AI developers keen to use their archive in October 2023. These include releasing the resulting models under an open license, publishing a record of all the Software Heritage data used in training, and providing mechanisms for authors to opt out of their code being used to train AI.
In February 2024, the BigCode project, a scientific collaboration aimed at open and responsible AI development, unveiled the coding assistant StarCoder2, which was the first LLM trained on Software Heritage data. But Di Cosmo says the project highlighted many limitations and inefficiencies with the way people were building these models.
After being provided with access to the dataset, the BigCode team had to go through a painstaking data cleaning process—removing duplicate entries, filtering out low-quality or malicious code, and removing personally…
Read full article: CommonCode Is a New Project for Open-Source Coding AIs
The post “CommonCode Is a New Project for Open-Source Coding AIs” by Edd Gent was published on 01/28/2025 by spectrum.ieee.org
Leave a Reply