n this blog post, we provide a tutorial on how to use a new data augmentation technique for document images, developed in collaboration with Albumentations AI.Motivation
Vision Language Models (VLMs) have an immense range of applications, but they often need to be fine-tuned to specific use-cases, particularly for datasets containing document images, i.e., images with high textual content. In these cases, it is crucial for text and image to interact with each other at all stages of model training, and applying augmentation to both modalities ensures this interaction. Essentially, we want a model to learn to read properly, which is challenging in the most common cases where data is missing.
Hence, the need for effective data augmentation techniques for document images became evident when addressing challenges in fine-tuning models with limited datasets. A common concern is that typical image transformations, such as resizing, blurring, or changing background colors, can negatively impact text extraction accuracy.
(View Highlight)
To address this need, we introduce a new data augmentation pipeline developed in collaboration with Albumentations AI. This pipeline handles both images and text within them, providing a comprehensive solution for document images. This class of data augmentation is multimodal as it modifies both the image content and the text annotations simultaneously.
As discussed in a previous blog post, our goal is to test the hypothesis that integrating augmentations on both text and images during pretraining of VLMs is effective. Detailed parameters and use case illustrations can be found on the Albumentations AI Documentation. Albumentations AI enables the dynamic design of these augmentations and their integration with other types of augmentations. (View Highlight)
To augment document images, we begin by randomly selecting lines within the document. A hyperparameter fraction_range controls the bounding box fraction to be modified.
Next, we apply one of several text augmentation methods to the corresponding lines of text, which are commonly utilized in text generation tasks. These methods include Random Insertion, Deletion, and Swap, and Stopword Replacement.
After modifying the text, we black out parts of the image where the text is inserted and inpaint them, using the original bounding box size as a proxy for the new text’s font size. The font size can be specified with the parameter font_size_fraction_range, which determines the range for selecting the font size as a fraction of the bounding box height. Note that the modified text and corresponding bounding box can be retrieved and used for training. This process results in a dataset with semantically similar textual content and visually distorted images. (View Highlight)
Inserting any text on the image: This feature allows you to overlay text on document images, effectively generating synthetic data. By using any random image as a background and rendering completely new text, you can create diverse training samples. A similar technique, called SynthDOG, was introduced in the OCR-free document understanding transformer.
Inserting augmented text on the image: This includes the following text augmentations:
• Random deletion: Randomly removes words from the text.
• Random swapping: Swaps words within the text.
• Stop words insertion: Inserts common stop words into the text.
Combining these augmentations with other image transformations from Albumentations allows for simultaneous modification of images and text. You can retrieve the augmented text as well. (View Highlight)