Metadata
- Author: Benjamin Clavié
- Full Title: Rerankers: A Lightweight Python Library to Unify Ranking Methods
- URL: https://www.answer.ai/posts/2024-09-16-rerankers.html
Highlights
- We’ve released (a while ago, now, with no further report of any major issues, warranting this blog post!)
rerankers
, a low-dependency Python library to provide a unified interface to all commonly used re-ranking models. It’s available on GitHub here. In this post, we quickly discuss:- Why two-stage pipelines are so popular, and how they’re born of various trade-offs.
- The various methods now commonly used in re-ranking.
rerankers
itself, its design philosophy and how to use it. (View Highlight)
- In Information Retrieval, the use of two-stage pipelines is often regarded as the best approach to maximise retrieval performance. In effect, this means that a small set of candidate documents is first retrieved by a computationally efficient retrieval method, to then be re-scored by a stronger, generally neural network-based, model. This latter stage is widely known as
re-ranking
, as the list of retrieved documents is re-ordered by the second model. (View Highlight) - However, using re-ranking models is often more complex than it needs to be. For a starter, there is a lot of methods, with their different pros and cons, and it’s often difficult to know which one is the best for a given use case. This issue is compounded by the fact that most of these methods are implemented in sometimes wildly different code-bases. As a result, trying out different approaches can require a non-trivial amount of work, which would be better spent in other areas. (View Highlight)
- A while back, I posted a quick overview of the “best starter re-ranking model” for every use case, based on latency requirements and environment constraints on Twitter, to help people get started in their exploration. It got unexpectedly popular, as it’s quite a difficult environment to map. Below is an updated version of that chart, incorporating a few new models, including our very own answerdotai/answer-colbert-small-v1: (View Highlight)
- As you can see, even figuring out your starting point can be complicated! In production situations, this often means that re-ranking gets neglected, as the first couple solutions are make-or-break: either they’re “good enough” and get used, even if suboptimal, or they’re not good enough, and re-ranking gets relegated to future explorations.
To help solve this problem, we introduced the rerankers library.
rerankers
is a low-dependency, compact library which aims to provide a common interface to all commonly used re-ranking methods. It allows for easy swapping between different methods, with minimal code changes, while keeping a unified input/output format.rerankers
is designed with extensibility in mind, making it very easy to add new methods, which can either be re-implementations, or simply a wrapper for existing code-bases. (View Highlight) - The problem essentially boils down to the trade-off between performance and efficiency. The most common way to do retrieval is to use a lightweight approach, either keyword-based (BM25), or based on neural-network generated embeddings. In the case of the latter, you will simply embed your query with the same model that you previously embedded your documents with, and will use cosine similarity to measure how “relevant” certain documents are to the query: this is what gets called “vector search”. (View Highlight)
- In the case of both keyword-based retrieval and vector search, the computational cost of the retrieval step is extremely low: you, at most, need to run inference for a single, most likely short, query, and very computationally cheap similarity computations. However, this comes at a cost: this retrieval step is performed in a “cold” way: your documents were processed a long time ago, and their representations are frozen in time. This means that they’re entirely unaware of the information you’re looking for with your query, making the task harder, as the model is expected to be able to represent both documents and queries in a way that’ll make them easily comparable. Moreover, it has to do so without even knowing what kind of information we’ll be looking for! (View Highlight)
- This is where re-ranking comes in. A ranking model, typically, will always consider both queries and documents at inference-time, and will accordingly rank the documents by relevance. This is great: your model is both query-aware and document-aware at inference time, meaning it can capture much more fine-grained interactions between the two. As a result, it can capture nuances that your query might require which would otherwise be missed. However, the computational cost is steep: in this set-up, representations cannot be pre-computed, and inference must be run on all potentially relevant documents. This makes this kind of model completely unsuitable for any sort of large, or even medium, scale retrieval task, as the computational cost would be prohibitive. (View Highlight)
- You can probably see where I’m going with this, now: why not combine the two? If we’ve got families of models that are able to very efficiently retrieve potentially relevant documents, and another set of models which are much less efficient, but able to rank documents more accurately, why not use both? By using the former, you can generate a much more restricted set of candidate documents, by fetching the 10, 50, or even 100 most “similar” documents to your query. You can then use the latter to re-rank this manageable-sized set of documents, to produce your final ordered ranking: (View Highlight)
- This is essentially what two-stage pipelines boil down to: they work around the trade-offs of various retrieval approaches to produce the best possible final ranking, with fast-but-less-accurate retrieval models feeding into slow-but-more-accurate ranking models. (View Highlight)
- For a long time, re-ranking was dominated by cross-encoder models, which are essentially just binary sentence classification models, using BERT-like models: these models are given both the query and a document as input, and they output a “relevance” score for the pair, which is the probability it assigns to the positive class. This approach, outputting a score for each query-document pair, is called Pointwise re-ranking. (View Highlight)
- However, as time went on, an increasing number of new, powerful re-ranking methods have merged. One such example is MonoT5, where the model is trained to output a “relevant” or “irrelevant” token, with the likelihood of the “relevant” token being outputted being used as a relevance score. This line of work has recently been revisited with LLMs, with models such as BGE-Gemma2 calibrating a 9 billion parameter model to output relevance scores through the log-likelihood of the “relevant” token. (View Highlight)
- Another example is the use of late-interaction retrieval models, such as our own answerdotai/answer-colbert-small-v1 (read more about it here), repurposed as re-ranking models. (View Highlight)
- Other methods do not directly output relevance scores, but simply re-order documents by relevancy. These are called Listwise methods: they take in a list of documents, and re-output the document with an updated order, based on relevance. This has traditionally been done using T5-based models. However, recent work is now exploring the use of LLMs for this, either in a zero-shot fashion (RankGPT), or by fine-tuning smaller models on the output of frontier models (RankZephyr). (View Highlight)
- he main point is that there exist many different approaches to re-ranking, each with their own pros and cons. The more annoying truth is also that there currently is no silver bullet re-ranking method that’ll work for all use cases: you have to figure out exactly which one works best for your situation (and sometimes, that even involves fine-tuning your own!). Even more annoying is that doing so requires quite a lot of code iteration, as most of the methods listed above are not implemented in a way that’ll allow for easy swapping out of one for another. They all expect inputs formatted in a certain way while also outputting scores in their own way. (View Highlight)
rerankers
as a library follows a clear design philosophy, with a few key points: • As with our other retrieval libraries, RAGatouille and Byaldi, the goal is to be fully-featured while requiring the fewest lines of code possible. • It aims to provide support for all common re-ranking methods, through a common interface, without any retrieval performance degradation compared to official implementations. •rerankers
must be lightweight and modular. It is low-dependency, and it should allow users to only install the dependencies required for their chosen methods. • It should be easy to extend. It should be very easy to add new methods, whether they’re custom re-implementations, or wrappers around existing libraries. (View Highlight)- Every method supported by
rerankers
is implemented around theReranker
class. It is used as the main interface to load models, no matter the underlying implementation or requirements. You can initialise aReranker
with a model name or path, with full HuggingFace Hub support, and amodel_type
parameter, which specifies the type of model you’re loading. By default, aReranker
will attempt to use the GPU and half-precision if available on your system, but you can also pass adtype
anddevice
(when relevant) to further control how the model is loaded. API-based methods can be passed anAPI_KEY
, although the better way is to use the API provider’s preferred environment variable. (View Highlight) - Similarly to how
Reranker
serves as a single interface to various models,RankedResults
objects are a centralised way to represent the outputs of various models, themselves containingResult
objects. BothRankedResults
andResult
are fully transparent, allowing you to iterate throughRankedResults
and retrieve their associated attributes. (View Highlight) RankedResults
andResult
’s main aim is to serve as a helper. Most notably, eachResult
object stores the original document, as well as the score outputted by the model, in the case of pointwise methods. They also contain the document ID, and, optionally, document meta-data, to facilitate usage in production settings. The output ofrank()
is always aRankedResults
object, and will always preserve all the information associated with the documents:Ranking a set of documents returns a RankedResults object, preserving meta-data and document-ids.
results = ranker.rank(query=“I love you”, docs=[“I hate you”, “I really like you”], doc_ids=[0,1], metadata=[{‘source’: ‘twitter’}, {‘source’: ‘reddit’}]) resultsRankedResults(results=[Result(document=Document(text=‘I really like you’, doc_id=1, metadata={‘source’: ‘twitter’}), score=-2.453125, rank=1), Result(document=Document(text=‘I hate you’, doc_id=0, metadata={‘source’: ‘reddit’}), score=-4.14453125, rank=2)], query=‘I love you’, has_scores=True) (View Highlight)
- Modularity
rerankers
is designed specifically with ease of extensibility in mind. All approaches are independently-implemented and have individually-defined sets of dependencies, which users are free to install or not based on their needs. Informative error messages are shown when a user attempts to load a model type that is not supported by their currently installed dependencies. Extensibility As a result, adding a new method simply requires making its inputs and outputs compatible with thererankers
-defined format, as well as a simple modification of the mainReranker
class to specify a default model. This approach to modularity has allowed us to support all the approaches with minimal engineering efforts. We fully encourage researchers to integrate their novel methods into the library and will provide support for those seeking to do so. (View Highlight)