Data continues to be essential for better models: We see continued evidence from published research, open-source experiments, and from the open-source community that better data can lead to better models. (View Highlight)
Empowering the community to build and improve datasets collectively will allow people to:
• Contribute to the development of Open Source ML with no ML or programming skills required.
• Create chat datasets for a particular language.
• Develop benchmark datasets for a specific domain.
• Create preference datasets from a diverse range of participants.
• Build datasets for a particular task.
• Build completely new types of datasets collectively as a community. (View Highlight)
One of the challenges to many previous efforts to build AI datasets collectively was setting up an efficient annotation task. Argilla is an open-source tool that can help create datasets for LLMs and smaller specialised task-specific models. Hugging Face Spaces is a platform for building and hosting machine learning demos and applications. Recently, Argilla added support for authentication via a Hugging Face account for Argilla instances hosted on Spaces. This means it now takes seconds for users to start contributing to an annotation task. (View Highlight)