From PDFs to AI-ready Structured Data: A Deep Dive

rw-book-cover

Metadata

Author: explosion.ai
Full Title: From PDFs to AI-ready Structured Data: A Deep Dive
URL: https://explosion.ai/blog/pdfs-nlp-structured-data

Highlights

PDFs are ubiquitous in industry and daily life. Paper is scanned, documents are sent and received as PDF, and they’re often kept as the archival copy. Unfortunately, processing PDFs is hard. In this blog post, I’ll present a new modular workflow for converting PDFs and similar documents to structured data and show how to build end-to-end document understanding and information extraction pipelines for industry use cases. (View Highlight)
With more powerful Vision Language Models (VLMs), it’s finally become viable to complete many end-to-end tasks using PDFs as inputs, like question answering or more classic information extraction. This makes it tempting to consider PDF processing “solved” and treat PDF documents like yet another data type. I’ve even heard from people now converting plain text to PDFs because their AI-powered tool of choice was designed for PDFs. (Note: Don’t do this!) (View Highlight)
When working with data, you typically want to operate from a “source of truth” with a structure you can rely on and develop against. This is a big reason why we use relational databases. The problem is, saying “I have the data in a PDF” is about as meaningful as saying “I have it on my computer” – it can mean anything. It may be plain text, scanned photos of text with varying image quality, or a combination of both. The layout properties and images embedded in the document may be extremely relevant, or they may not. All of these things fundamentally change the approach required to extract useful information. Machine learning rarely happens in a vacuum. There’s always an end goal: a product feature or a business question you want to answer. (View Highlight)
So I believe it’s crucial to get your data out of PDFs as early as possible. If you’re dealing with text, it shouldn’t matter whether it came from a PDF, a Word document or a database. All of these formats are used interchangeably to store the same information. (View Highlight)
If you use PDFs as the “source of truth” for machine learning, you end up with a monolithic and operationally complex approach. For example, to sort PDFs into different categories, the model has to do many things at once: process the document, find text, extract it where necessary, embed it all, and predict a classification label. And in the case of Retrieval-Augmented Generation (RAG), additionally parse the question, find the relevant document, find the relevant slice of the document and formulate a response. If we remove the document format PDF and its intricacies from the equation, the task suddenly becomes fairly straightforward: text classification, with optional layout features. (View Highlight)
At their core, many NLP systems consist of relatively flat classifications. You can shove them all into a single prompt, or you can decompose them into smaller pieces that you can work on independently. A lot of classification tasks are actually very straightforward to solve nowadays – but they become vastly more complicated if one model needs to do them all at once. (View Highlight)
These are all considerations that went into developing some of our own workflows for handling PDFs, specifically in the context of Natural Language Processing (NLP) and large-scale information extraction. It’s been one of the bigger missing pieces for smooth, end-to-end NLP in industry and will hopefully be useful for teams working with various input formats, including PDFs, Word documents and scans. (View Highlight)
Docling is developed by a team at IBM Research, who have also trained their own layout analysis and table recognition models. It takes a pipeline approach, combining modules for file parsing, layout analysis, Optical Character Recognition (OCR), table structure recognition and postprocessing to generate a unified, structured format. This makes it a great complement to spaCy, which is designed around the structured Doc object, a container for linguistic annotations that always map back into the original document. (View Highlight)
spacy-layout extends spaCy with document processing capabilities for PDFs, Word documents and other formats, and outputs clean, text-based data in a structured format. Document and section layout features are accessible via a layout extension attribute and can be serialized in an efficient binary format. (View Highlight)
Tables are an interesting case, because conceptually, they’re exactly what we like: structured information, mostly stripped from natural language. However, if we come across them in documents, they typically need to be interpreted in relation to the rest of the contents. It’s important to remember here that humans often struggle with interpreting figures, too. We also can’t trust them to present data well. (View Highlight)
Docling uses the TableFormer model developed by its team, and tables are integrated into spacy-layout via the layout spans, and the shortcut doc._.tables. Each table is anchored into the original document text and also accessible as a pandas.DataFrame, a convenient data structure for storing and manipulating tabular data. (View Highlight)
An important consideration is how to represent the tabular data in the document text, i.e. the doc.text, which is plain unicode that’s tokenized and then used by further components in the spaCy processing pipeline for predictions like linguistic attributes, named entities and text categories. By default, a placeholder TABLE is used, but this can be customized via a callback function that receives the DataFrame and returns its textual representation: (View Highlight)
This offers many opportunities for preprocessing tabular data to make it easier for a model to to extract information. One hypothesis we want to test is whether we can achieve better results by using a Large Language Model (LLM) to rephrase the tabular data as natural language, i.e. sentences, so it becomes more accessible for tasks like question answering or classification. (View Highlight)
With a workflow for extracting PDF contents as structured Doc objects, we can now choose from an array of NLP techniques, components and pretrained pipelines, and fine-tune our own for specific business use cases. We can also take advantage of LLMs and other models to automate data creation and use human-in-the-loop distillation to produce smaller, faster and fully private task-specific components. (View Highlight)
Transfer learning is a robust and very scalable method to improve performance with examples specific to your use case. Even just a few hundred task-specific examples can make a meaningful impact, but these examples need to be of high quality and apply the label scheme consistently. In any case, you’ll always want a stable evaluation – as a rule of thumb, we typically recommend 10 samples per significant figure to avoid reporting meaningless precision. Using models as judges can give you a helpful estimation, but it won’t replace testing your system against questions that you know the answer to. (View Highlight)
The prodigy-pdf plugin adds text- and image-based workflows for annotating and transcribing PDF documents, including for selecting document sections, correcting OCR results and adding or correcting spans in the document text. (View Highlight)
The pdf.spans.manual recipe extracts the PDF contents as text and presents it side-by-side with the original document layout. This example uses “focus mode” and walks through the document section by section. The original layout span and its bounding box coordinates are preserved in the data, in case you need to reference them later on. (View Highlight)
An alternative approach to PDF annotation is to take the pre-selection one step further and start by highlighting the relevant parts of the document visually using a recipe like pdf.image.manual. This can be helpful if the documents contain a lot of information that’s not relevant for the task, like design elements or images. The bounding boxes don’t have to be exact, making the process relatively quick, and can later be extracted individually, e.g. to correct the OCR transcription. (View Highlight)
While multi-step workflows may sound like more work, experiments have shown that breaking down tasks into simpler questions can make overall annotation up to 10 times faster, even if it means making more than one pass over the data. This makes sense: you’re reducing the cognitive load and helping the human focus, thus making them significantly faster in total. (View Highlight)
When designing efficient workflows to handle documents with layout information, it’s important to examine the role the layout plays in the overall context. Layout and visual cues help us humans convey and understand information, which makes them feel very important. But it helps to take a step back and ask if it actually matters for the task at hand. In many cases, you’ll find that it’s much less relevant for the model than you think. (View Highlight)
It’s also worth considering that learning from layout features will make the model worse at generalizing to new documents. If you want the model to make the same predictions on two documents that are formatted differently, it’s best to abstract away the layout. The less incidental information you can give to the model, the better. However, for cases where documents mostly follow the same structure, incorporating layout information can be very beneficial to the model. Doing your own annotation early on can help you make this decision, so it’s a vital part of the development process. (View Highlight)
With this modular approach, you’re able to separately improve the information extraction components and train them on data extracted from PDFs, or other sources to increase the size of your training corpus. You can also work on only the PDF extraction logic by adding fix-up rules and modifying the Doc before it is processed by the model. (View Highlight)
Robust extraction workflows should ideally be non-destructive: the result should represent the original document as accurately as possible and at any stage of the process, you should be able to relate annotations back to the original input. This is also a core principle of spaCy’s data structures and tokenization, and is reflected in the Doc object and layout spans created by spaCyLayout. (View Highlight)
For instance, when you process the Doc with a named entity recognition component, the created entity spans are pointers into the document and can be matched up with layout sections, which are also pointers to document slices: (View Highlight)
Docling runs at 1-3 pages per second on CPU, which makes it feasible to do PDF extraction in the loop during annotation and at runtime. Processing speeds will likely improve further once support for GPU is added. spaCy is very fast, so the overhead it adds is absolutely minimal. (View Highlight)
Accuracy on your specific task will depend on the document type. Docling’s layout analysis model is based on their DocLayNet corpus, a human-annotated dataset for document-layout segmentation, as well as other proprietary datasets. There’s a high representation of scientific and financial documents, as well as company reports, which indicates that it’ll translate well to many common industry use cases. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

AI Learning Paths for Software Engineers Without Becoming a Data Scientist

Power and Prediction

Why Software Engineers Should Learn a Bit of Data Science

From PDFs to AI-ready Structured Data: A Deep Dive

Metadata

Highlights

Graph View

Table of Contents

Now Reading

John Snow Probably Didn’t Use That Broad Street Map to Reach His Conclusions About Cholera