rw-book-cover

Metadata

Highlights

  • At Hugging Face, we are building the Dataset Hub as the place for the community to collaborate on open datasets. So we built tools like Dataset Search and the Dataset Viewer, as well as a rich open source ecosystem of tools. Today we are announcing four new features that will take Dataset Search on the Hub to the next level. (View Highlight)
  • We released a set of filters that allows you to filter datasets that have one or several modalities among this list: • Text • Image • Audio • Tabular • Time-Series • 3D • Video • Geospatial (View Highlight)
  • For example, it is possible to look for datasets that contain both text and image data (View Highlight)
  • We recently released a new feature in the interface to show the number of rows of each dataset: (View Highlight)
  • number of rows of each dataset (View Highlight)
  • Following this, it is now possible to search datasets by a number of rows by specifying a minimum and maximum number of rows. This will let you look for datasets of small size to the biggest datasets that exist (for example, the ones used to pretrain LLMs). The information about the number of rows is available for all the datasets in supported formats. Even for the biggest datasets for which the number of rows is not included in the metadata the total number of rows is estimated accurately based on the content of the first 5GB. (View Highlight)