Data Scientists and Machine Learning Engineers have long been involved in preparing the ideal datasets for training and evaluating successful models. However, over the past few years, I’ve noticed a significant increase in the importance of curating these datasets within the data function.
The availability of large pre-trained models has simplified the process of creating custom versions to address specific business problems. In this context, we can’t overlook the significant role played by Hugging Face in democratizing access to highly competent models. Furthermore, Hugging Face libraries like Datasets and 🤗 Transformers have facilitated easier access to vast datasets and models. They’ve also simplified transfer learning, fine-tuning, and efficient adaptation of specific Deep Learning architectures to current problems.
A few years back, training a Deep Learning model required highly specialized individuals—those who were passionate about reading Computer Science papers and implementing them using TensorFlow, PyTorch or pure CUDA (see tweet below). However, high-level programming languages like Keras, fastai or PyTorch Lightning have simplified training processes.
🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention)https://t.co/ze4LKbnHKV
— Andrej Karpathy (@karpathy) April 19, 2024
On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,… pic.twitter.com/Ep8MwJFoWV
This has enabled Data Scientists who aren’t necessarily academically inclined to train models. Today’s landscape is even more user-friendly; you can simply upload a dataset to a fine-tuning service and fine tune Stable Diffussion LoRAs , Computer Vision Deep Learning networks or even fine tune Large Language Models or a Large Vision Model with minimal knowledge required.
We are currently witnessing an influx of tech enthusiasts and business roles joining Data Science teams to build Artificial Intelligence solutions. This is fantastic news for companies as they can leverage cutting-edge technology to solve business problems without encountering bottlenecks caused by understaffed data teams. The key ingredients often missing from these initiatives are sound evaluation practices and high-quality data. Fortunately, good evaluation techniques can be taught, which helps non-data roles comprehend their importance and how evaluation fits into the Data Science iterative cycle.
Note
Constructing an optimal, unbiased, high-quality dataset requires an engineering, statistical and operations mindset.
Regarding high-quality data, it previously required significant manual labor to obtain labeled and heavily curated data. However, we now have excellent data annotation and curation tools at our disposal. Data annotation tools focus on adding metadata to data observations—this could take the form of NER spans, segmented images or preferences in chat interactions. Data curation tools allow us to assess and curate the data—for instance, ensuring that we have an unbiased dataset representing our target population or avoiding the addition of near duplicate samples to our training set.
In addition to a more mature ecosystem of annotation and curation tools, it’s now possible to utilize the exceptional capabilities of LLMs, YOLO and SAM for pre-annotating data. Consequently, creating a dataset no longer starts from scratch. This shift has resulted in a significant portion of work moving from programming algorithms towards efficiently labeling data using techniques such as few-shot learning, Active Learning or other forms of models in the loop.
Tools like argilla, CVAT and FiftyOne can be used alongside pre-annotated or synthetic data for training, evaluating and monitoring our models. In this scenario, an engineering mindset is crucial—one that understands the iterative nature of the Data Science process—to optimize data annotation or synthetic data generation.
Industry leaders like Andrew Ng have long advocated for a Data-centric approach to AI modeling. Recently, Chip Huyen included Dataset Engineering as a central topic in her upcoming book.
While I’m not entirely convinced about introducing another term under the broad umbrella of Data Science roles just yet, it is clear that we’re all thinking about how to procure high-quality datasets for training our models. Like it or not, the old data preparation function is evolving. Initially, programmatic labeling may be tackled by existing roles such as Data Scientists or Machine Learning Engineers. However, considering the constant evolution of roles, the industry tendency to incorporate new and confusing terms, and the emergence of highly specialized and well-compensated positions, we may soon see Dataset Engineering listed as a desired skill in job postings.