There’s finally an “official” definition of open source AI.
The Open Source Initiative (OSI), a long-running institution aiming to define and “steward” all things open source, today released version 1.0 of its Open Source AI Definition (OSAID). The product of several years of collaboration with academia and industry, the OSAID is intended to offer a standard by which anyone can determine whether AI is open source — or not. (View Highlight)
To be considered open source under the OSAID, an AI model has to provide enough information about its design so that a person could “substantially” recreate it. The model must also disclose any pertinent details about its training data, including the provenance, how the data was processed, and how it can be obtained or licensed. (View Highlight)
“An open source AI is an AI model that allows you to fully understand how it’s been built,” Maffulli said. “That means that you have access to all the components, such as the complete code used for training and data filtering.” (View Highlight)
“Our hope is that when someone tries to abuse the term, the AI community will say, ‘We don’t recognize this as open source,’ and it gets corrected,” Maffulli said. Historically, this has had mixed results, but it isn’t entirely without effect. (View Highlight)
Many startups and big tech companies, most prominently Meta, have employed the term “open source” to describe their AI model release strategies — but few meet the OSAID’s criteria. For example, Meta mandates that platforms with more than 700 million monthly active users request a special license to use its Llama models. (View Highlight)
Maffulli has been openly critical of Meta’s decision to call its models “open source.” After discussions with the OSI, Google and Microsoft agreed to drop their use of the term for models that aren’t fully open, but Meta hasn’t, he said. (View Highlight)
Stability AI, which has long advertised its models as “open,” requires that businesses making more than $1 million in revenue obtain an enterprise license. And French AI upstart Mistral’s license bars the use of certain models and outputs for commercial ventures. (View Highlight)
Instead of democratizing AI, these “open source” projects tend to entrench and expand centralized power, the study’s authors concluded. Indeed, Meta’s Llama models have racked up hundreds of millions of downloads, and Stability claims that its models power up to 80% of all AI-generated imager (View Highlight)
Meta disagrees with this assessment, unsurprisingly — and takes issue with the OSAID as written (despite having participated in the drafting process). A spokesperson defended the company’s license for Llama, arguing that the terms — and accompanying acceptable use policy — act as guardrails against harmful deployments. (View Highlight)
“We agree with our partner the OSI on many things, but we, like others across the industry, disagree with their new definition,” the spokesperson said. “There is no single open source AI definition, and defining it is a challenge because previous open source definitions do not encompass the complexities of today’s rapidly advancing AI models. We make Llama free and openly available, and our license and acceptable use Policy help keep people safe by having some restrictions in place. We will continue working with the OSI and other industry groups to make AI more accessible and free responsibly, regardless of technical definitions.” (View Highlight)
The spokesperson pointed to other efforts to codify “open source” AI, like the Linux Foundation’s suggested definitions, the Free Software Foundation’s criteria for “free machine learning applications,” and proposals from other AI researchers.
Meta, incongruously enough, is one of the companies funding the OSI’s work — along with tech giants like Amazon, Google, Microsoft, Cisco, Intel, and Salesforce. (The OSI recently secured a grant from the nonprofit Sloan Foundation to lessen its reliance on tech industry backers.) (View Highlight)
Meta’s reluctance to reveal training data likely has to do with the way its — and most — AI models are developed. (View Highlight)
AI companies scrape vast amounts of images, audio, videos, and more from social media and websites, and train their models on this “publicly available data,” as it is usually called. In today’s cut-throat market, a company’s methods of assembling and refining datasets are considered a competitive advantage, and companies cite this as one of the main reasons for their nondisclosure.
But training data details can also paint a legal target on developers’ backs. Authors and publishers claim that Meta used copyrighted books for training. Artists have filed suits against Stability for scraping their work and reproducing it without credit, an act they compare to theft. (View Highlight)
Some suggest the definition doesn’t go far enough, for instance in how it deals with proprietary training data licensure. Luca Antiga, the CTO of Lightning AI, points out that a model may meet all of the OSAID’s requirements despite the fact that the data used to train it isn’t freely available. Is it “open” if you have to pay thousands to inspect the private stores of images that a model’s creators paid to license? (View Highlight)
“To be of practical value, especially for businesses, any definition of open source AI needs to give reasonable confidence that what is being licensed can be licensed for the way that an organization is using it,” Antiga told TechCrunch. “By neglecting to deal with licensing of training data, the OSI is leaving a gaping hole that will make terms less effective in determining whether OSI-licensed AI models can be adopted in real-world situations.” (View Highlight)
In version 1.0 of the OSAID, the OSI also doesn’t address copyright as it pertains to AI models, and whether granting a copyright license would be enough to ensure a model satisfies the open source definition. It’s not clear yet whether models — or components of models — can be copyrighted under current IP law. But if the courts decide they can be, the OSI suggests new “legal instruments” may be needed to properly open source IP-protected models. (View Highlight)
Maffulli agreed that the definition will need updates — perhaps sooner than later. To this end, the OSI has established a committee that’ll be responsible for monitoring how the OSAID is applied, and proposing amendments for future versions.
“This isn’t the work of lone geniuses in a basement,” he said. “It’s work that’s being done in the open with wide stakeholders and different interest groups.” (View Highlight)