Drawing an owl is no longer a daunting task. It’s as simple as sketching a few circles, creating a prompt on krea.ai, Stable Diffusion, or any similar service. And voila, you’re done!
How to draw an owl. 2023.
— Vadik One (@vadik_hq) January 9, 2023
Thanks to @midjourney pic.twitter.com/VWZdgIe0ZJ
Just as drawing an owl involves mastering the use of a pencil to achieve correct proportions, sensible shadows and lights, and intricate details, so too does ‘Generating a dataset’ present its own overlooked challenges.
It appears that DALL·E 3 has been trained successfully using captions by GPT (as per LLMs, OpenAI Dev Day, and the Existential Crisis for Machine Learning Engineering - YouTube).
However, the misconception lies in assuming that a powerful GPT is sufficient to generate synthetic datasets independently without needing further refinement. The brilliant team at Argilla consistently demonstrates that high-quality, human-curated datasets are the key to success.
This brings us to an intriguing question: what exactly constitutes good synthetic data?