A Practical Guide to Human-in-the-Loop Distillation

rw-book-cover

Metadata

Author: explosion.ai
Full Title: A Practical Guide to Human-in-the-Loop Distillation
URL: https://explosion.ai/blog/human-in-the-loop-distillation

Highlights

As the field of natural language processing advances and new ideas develop, we’re seeing more and more ways to use compute efficiently, producing AI systems that are cheaper to run and easier to control. (View Highlight)
Large Language Models (LLMs) have enormous potential, but also challenge existing workflows in industry that require modularity, transparency and data privacy. (View Highlight)
Amidst all the hype, it’s important to keep in mind that AI development is still just software development, and a lot of the same principles apply. Workflows are changing quickly and many new paradigms are emerging – but in the end, there are basic things about technologies that help teams build on them reliably. We’ve learned over time as an industry that we want our solutions to be: • modular, so that we can combine a smaller set of primitives we understand in different ways • transparent, so that we can debug or prevent issues • explainable, so that we can build the right mental model of how things are working (View Highlight)
If the AI system is a black box that’s tasked with solving the whole problem at once, we get very little insights into how our application performs in different scenarios and how we can improve it. It becomes a purely empirical question: just try and see. (View Highlight)
There’s also a second dimension that applies to the whole development workflow and lifecycle. We often need that to be: • data-private, so that internal data doesn’t leave our servers and we can meet legal and regulatory requirements • reliable, so that we have stable request times with few failures • affordable, so that it fits within our budget (View Highlight)
This can make it impractical to rely on third-party APIs. Unfortunately it’s not necessarily true that running open-source models is cheaper than using APIs, as providers like OpenAI do have solid economies of scale. However, moving the dependency on large generative models from runtime to development changes this calculation. If we’re able to produce a smaller, task-specific model, we can achieve a much more cost-effective solution that also provides full data privacy and lower operational complexity, leading to significantly higher reliability. (View Highlight)
In-context learning: using a natural language prompt to access information without fine-tuning the model weights for a speific task Transfer learning: using knowledge learned from a different task (e.g. a language modelling objective like predicting the next word) to improve generalization of another (View Highlight)
The focus on in-context learning as a new technology doesn’t mean that transfer learning has somehow been replaced or become obsolete. It’s simply a different tool used for a different purpose. The recent literature shows that transfer learning using embeddings like BERT-base is still very competitive compared to few-shot in-context learning approaches (View Highlight)
This does make sense: it would be weird if there was no scalable way to improve a model with labelled data. It’s also worth keeping in mind that evaluations in research are typically performed on benchmark datasets, i.e. datasets we inherently don’t control. In applied work, however, we have full control of the data. So if accuracies are already promising using benchmark data, we’ll likely be able to achieve even better results by creating our own task-specific data. (View Highlight)
In-context learning introduces a new workflow for using pretrained models to perform predictive tasks: instead of training a task network on top of an encoder model, you typically use a prompt template and a corresponding parser that transforms the raw text output into the task-specific structured data. (View Highlight)
For distillation, we can then use the outputs as training data for a task-specific model. Ideally, we’ll want to include a human-in-the-loop step, so we can correct any errors the LLM is making. If we’re not correcting errors, we can’t really hope to do better than the LLM – garbage in, garbage out, as they say. (View Highlight)
t’s never been easier to build impressive demos and prototypes. But at the same time, many teams are experiencing what I also call the “prototype plateau”: when it comes time to productionize a solution, the prototype doesn’t translate into a usable system. (View Highlight)
One of the root causes often lies in diverging workflows. If the prototype assumes fundamentally different inputs and outputs compared to the production system, like sample text to text vs. user-generated text to structured data, there’s no guarantee that it will actually work the same way in practice. Ideally, you want a prototyping workflow that standardizes inputs and outputs and uses the same data structures that your application will need to work on and produce at runtime. (View Highlight)
Taking a step back and working on a robust and stable evaluation may not be the most exciting part of the prototyping process, but it’s crucial. Without a good evaluation, you’ll have no way of knowing whether your system is improving and whether changes you make actually lead to better results. Your evaluation should consist of concrete examples from your runtime data that you know the correct answer to – it’s not enough to just try it out with arbitrary examples you can think of, or ask another LLM to do the evaluation for you. This is like evaluating on your training data: it doesn’t give you meaningful insights and only reproduces what a model predicts, which is what you want to evaluate in the first place. (View Highlight)
When evaluating your system, the accuracy score isn’t the only thing you should track: what’s even more important is the utility of your system: whether it solves the problem it’s supposed to, and whether it fits into the larger context of your application. (View Highlight)
n applied NLP, it’s important to pay attention to the difference between utility and accuracy. “Accuracy” here stands for any objective score you can calculate on a test set […]. This is a figure you can track, and if all goes well, that figure should go up. In contrast, the “utility” of the model is its impact in the application or project. This is a lot harder to measure, and always depends on the application context. (View Highlight)
Once you start engaging with your data, you’ll inevitably come across edge cases you hadn’t considered, and will need to adapt. This means that you need a workflow that allows you to work iteratively and take into account the structure and ambiguity of natural language. (View Highlight)
I believe that closing the gap between prototype and production ultimately comes down to the right developer tooling. We thought about this a lot when designing spacy-llm, an extension library for spaCy that provides components to use open-source and proprietary LLMs for a variety of structured prediction tasks like named entity recognition, text classification or relation extraction, with battle-tested prompts under the hood that can be customized if needed. (View Highlight)
Previously, trying out an idea and creating a model that predicts something and that you can improve upon required significant data annotation work with pretty uncertain outcomes. With spacy-llm you can now build a working prototype in mere minutes and access the predictions via the structured Doc object, representing the input text and document-, span- and token-level predictions. This also provides a standardized interface for inputs and outputs during prototyping and production, whether you decide to ship the LLM-powered approach, or replace the components with distilled versions for the production system. (View Highlight)
Bootstrapping the pipeline with LLM components gets us to a working system very quickly and produces annotated data for a specific task like text classification or named entity recognition. If we can improve the labels somewhat — either by doing some data aggregation to discard obvious problems, or by correcting them manually, or both — we can train on data that’s more correct than the LLM output, which gives us a chance of getting a model that’s not only smaller and faster, but more accurate, too. (View Highlight)
- Tags: favorite
The goal is to create gold-standard data by streaming in and correcting the LLM’s predictions, extracting only the information we’re interested in, until the accuracy of the task-specific model reaches or exceeds the zero- or few-shot LLM baseline. Using transfer learning and initializing the model with pretrained embeddings means that we’ll actually need relatively few task-specific examples. (View Highlight)
When putting a human in the loop, we do need to consider its natural limitations: humans are great at context and ambiguity but much worse than machines at remembering things and performing a series of repetitive steps consistently. We can also get overwhelmed and lose focus if presented with too much information, which leads to mistakes due to inattention. (View Highlight)
This means that the best way to ask a human for feedback and corrections can differ a lot from how we would structure a task for a machine learning model. Ideally, we want to construct the interface to reduce cognitive load and only focus on the minimum required information from the human to correct the annotations pre-selected by the LLM. (View Highlight)
The above example illustrates two possible interfaces defined with Prodigy. The multiple-choice interface (left) includes both positive and negative sentiment for various attributes, while also allowing null values if no option is selected, since not all reviews cover all available aspects. An alternative approach (right) is to make multiple passes over the data and annotate sentiment for one aspect at a time. While this seems like it creates more work for the human, experiments have shown that this approach can actually be significantly faster than annotating all categories at the same time, as it helps the annotator focus and lets them move through the examples much quicker. (Also see the S&P Global case study for a 10× increase in data development speed by annotating labels individually.) (View Highlight)
t’s easy to forget that in applied work, you’re allowed to make the problem easier! This isn’t university or academia, where you’re handed a dataset of fixed inputs and outputs. And as with other types of engineering, it isn’t a contest to come up with the most clever solutions to the most difficult problems. We’re trying to get things done. If we can replace a difficult problem with a much easier one, that’s a win. (View Highlight)
Reducing your system’s operational complexity also means that less can go wrong. And if things do, it becomes a lot easier to diagnose and fix them. At their core, many NLP systems consist of relatively flat classifications. You can shove them all into a single prompt, or you can decompose them into smaller pieces that you can work on independently. A lot of classification tasks are actually very straightforward to solve nowadays – but they become vastly more complicated if one model needs to do them all at once. (View Highlight)
nstead of throwing away everything we’ve learned about software design and asking the LLM to do the whole thing all at once, we can break up our problem into pieces, and treat the LLM as just another module in the system. When our requirements change or get extended, we don’t have to go back and change the whole thing. (View Highlight)
Just because the result is easier, it doesn’t necessarily mean that getting there is trivial. There just isn’t much discussion of this process, since research typically (and rightly) focuses on generalizable solutions to the hardest problems. You do need to understand all the techniques you have available, and make an informed choice. Just like with programming, this skill develops through experience and you get better at it the more problems you solve. (View Highlight)
When we refactor code, we’re often trying to regroup the logic into functions that have a clear responsibility, and clearly defined behavior. We shouldn’t have to understand the whole program to know how some function should be behaving – it should make sense by itself. When you’re creating an NLP pipeline, you’re also dividing up computation into a number of functions. In simple terms, this can be broken down into a formula like this: result = business_logic(classification(text)) (View Highlight)
Going back to our example of phone reviews, how can we divide up the work? It’s the same type of question that comes up in any other refactoring discussion, only that the stakes are much higher. You can really change how your model performs and how it generalizes to future data by the choices you make here. We can train a model to predict mentions of phone models, but we shouldn’t train it to tag mentions with “latest model”. Maybe some phone was the latest when the review was written, but isn’t now. There’s no way to know that from the text itself, which is what the model is working from. You could try to include all these extra database features as well to make the prediction, but… why? If you know what phone the review is talking about, and you have a database that tells you when that model was released, you can compute the information deterministically. (View Highlight)
Classification Business Logic product name SpacePhone catalog reference P3204-W2130 product model name Nebula latest model true phone true touchscreen phone true The model really has no chance to predict labels like “latest model” or “touchscreen” reliably. At best it can try to learn what’s latest at a given time, or memorize that “nebula” is always “touchscreen” (some very expensive conditional logic). This is really bad for your application: the model is learning something that doesn’t generalize, and your evaluation data might not show you the problem. (View Highlight)
Try to look through the texts from the model’s point of view. Imagine this is a task you’re being told to do, and the input is all you’ve got to work with. What do you need to know to get to the answer? If the model just needs to know what words generally mean and how sentences are structured, and then apply a fairly simple labelling policy, that’s good. If the model needs to combine that language knowledge with a lot of external knowledge, that’s bad. If the answer will depend on facts that change over time, that’s really bad. (View Highlight)
In many real-world applications, AI models are going to be just one function in a larger system. So as with any function, you can move the work around. You can have a function that does a whole task end-to-end, or you can break off some bits and coordinate the functions together. (View Highlight)
This refactoring process is very important, because it changes what the model learns and how it can generalize. With the right tooling, you can refactor faster and make sure you don’t just commit to your first idea. Taking control of more of the stack is also helpful, because you’re inheriting fewer constraints from how the upstream component happens to work. (View Highlight)
Finally, with the tools and techniques we have available, there’s absolutely no need to compromise on software development best practices or data privacy. People promoting “one model to rule them all” approaches might be telling that story, but the rest of us aren’t obliged to believe them. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

AI Learning Paths for Software Engineers Without Becoming a Data Scientist

Power and Prediction

Why Software Engineers Should Learn a Bit of Data Science

A Practical Guide to Human-in-the-Loop Distillation

Metadata

Highlights

Graph View

Table of Contents

Now Reading

John Snow Probably Didn’t Use That Broad Street Map to Reach His Conclusions About Cholera