rw-book-cover

Metadata

Highlights

  • We’ve released (a while ago, now, with no further report of any major issues, warranting this blog post!) rerankers, a low-dependency Python library to provide a unified interface to all commonly used re-ranking models. It’s available on GitHub here. In this post, we quickly discuss:
    1. Why two-stage pipelines are so popular, and how they’re born of various trade-offs.
    2. The various methods now commonly used in re-ranking.
    3. rerankers itself, its design philosophy and how to use it. (View Highlight)
  • In Information Retrieval, the use of two-stage pipelines is often regarded as the best approach to maximise retrieval performance. In effect, this means that a small set of candidate documents is first retrieved by a computationally efficient retrieval method, to then be re-scored by a stronger, generally neural network-based, model. This latter stage is widely known as re-ranking, as the list of retrieved documents is re-ordered by the second model. (View Highlight)
  • However, using re-ranking models is often more complex than it needs to be. For a starter, there is a lot of methods, with their different pros and cons, and it’s often difficult to know which one is the best for a given use case. This issue is compounded by the fact that most of these methods are implemented in sometimes wildly different code-bases. As a result, trying out different approaches can require a non-trivial amount of work, which would be better spent in other areas. (View Highlight)
  • A while back, I posted a quick overview of the “best starter re-ranking model” for every use case, based on latency requirements and environment constraints on Twitter, to help people get started in their exploration. It got unexpectedly popular, as it’s quite a difficult environment to map. Below is an updated version of that chart, incorporating a few new models, including our very own answerdotai/answer-colbert-small-v1: (View Highlight)
  • As you can see, even figuring out your starting point can be complicated! In production situations, this often means that re-ranking gets neglected, as the first couple solutions are make-or-break: either they’re “good enough” and get used, even if suboptimal, or they’re not good enough, and re-ranking gets relegated to future explorations. To help solve this problem, we introduced the rerankers library. rerankers is a low-dependency, compact library which aims to provide a common interface to all commonly used re-ranking methods. It allows for easy swapping between different methods, with minimal code changes, while keeping a unified input/output format. rerankers is designed with extensibility in mind, making it very easy to add new methods, which can either be re-implementations, or simply a wrapper for existing code-bases. (View Highlight)
  • The problem essentially boils down to the trade-off between performance and efficiency. The most common way to do retrieval is to use a lightweight approach, either keyword-based (BM25), or based on neural-network generated embeddings. In the case of the latter, you will simply embed your query with the same model that you previously embedded your documents with, and will use cosine similarity to measure how “relevant” certain documents are to the query: this is what gets called “vector search”. (View Highlight)
  • In the case of both keyword-based retrieval and vector search, the computational cost of the retrieval step is extremely low: you, at most, need to run inference for a single, most likely short, query, and very computationally cheap similarity computations. However, this comes at a cost: this retrieval step is performed in a “cold” way: your documents were processed a long time ago, and their representations are frozen in time. This means that they’re entirely unaware of the information you’re looking for with your query, making the task harder, as the model is expected to be able to represent both documents and queries in a way that’ll make them easily comparable. Moreover, it has to do so without even knowing what kind of information we’ll be looking for! (View Highlight)
  • This is where re-ranking comes in. A ranking model, typically, will always consider both queries and documents at inference-time, and will accordingly rank the documents by relevance. This is great: your model is both query-aware and document-aware at inference time, meaning it can capture much more fine-grained interactions between the two. As a result, it can capture nuances that your query might require which would otherwise be missed. However, the computational cost is steep: in this set-up, representations cannot be pre-computed, and inference must be run on all potentially relevant documents. This makes this kind of model completely unsuitable for any sort of large, or even medium, scale retrieval task, as the computational cost would be prohibitive. (View Highlight)
  • You can probably see where I’m going with this, now: why not combine the two? If we’ve got families of models that are able to very efficiently retrieve potentially relevant documents, and another set of models which are much less efficient, but able to rank documents more accurately, why not use both? By using the former, you can generate a much more restricted set of candidate documents, by fetching the 10, 50, or even 100 most “similar” documents to your query. You can then use the latter to re-rank this manageable-sized set of documents, to produce your final ordered ranking: (View Highlight)
  • This is essentially what two-stage pipelines boil down to: they work around the trade-offs of various retrieval approaches to produce the best possible final ranking, with fast-but-less-accurate retrieval models feeding into slow-but-more-accurate ranking models. (View Highlight)
  • For a long time, re-ranking was dominated by cross-encoder models, which are essentially just binary sentence classification models, using BERT-like models: these models are given both the query and a document as input, and they output a “relevance” score for the pair, which is the probability it assigns to the positive class. This approach, outputting a score for each query-document pair, is called Pointwise re-ranking. (View Highlight)
  • However, as time went on, an increasing number of new, powerful re-ranking methods have merged. One such example is MonoT5, where the model is trained to output a “relevant” or “irrelevant” token, with the likelihood of the “relevant” token being outputted being used as a relevance score. This line of work has recently been revisited with LLMs, with models such as BGE-Gemma2 calibrating a 9 billion parameter model to output relevance scores through the log-likelihood of the “relevant” token. (View Highlight)
  • Another example is the use of late-interaction retrieval models, such as our own answerdotai/answer-colbert-small-v1 (read more about it here), repurposed as re-ranking models. (View Highlight)
  • Other methods do not directly output relevance scores, but simply re-order documents by relevancy. These are called Listwise methods: they take in a list of documents, and re-output the document with an updated order, based on relevance. This has traditionally been done using T5-based models. However, recent work is now exploring the use of LLMs for this, either in a zero-shot fashion (RankGPT), or by fine-tuning smaller models on the output of frontier models (RankZephyr). (View Highlight)
  • he main point is that there exist many different approaches to re-ranking, each with their own pros and cons. The more annoying truth is also that there currently is no silver bullet re-ranking method that’ll work for all use cases: you have to figure out exactly which one works best for your situation (and sometimes, that even involves fine-tuning your own!). Even more annoying is that doing so requires quite a lot of code iteration, as most of the methods listed above are not implemented in a way that’ll allow for easy swapping out of one for another. They all expect inputs formatted in a certain way while also outputting scores in their own way. (View Highlight)
  • rerankers as a library follows a clear design philosophy, with a few key points: • As with our other retrieval libraries, RAGatouille and Byaldi, the goal is to be fully-featured while requiring the fewest lines of code possible. • It aims to provide support for all common re-ranking methods, through a common interface, without any retrieval performance degradation compared to official implementations. • rerankers must be lightweight and modular. It is low-dependency, and it should allow users to only install the dependencies required for their chosen methods. • It should be easy to extend. It should be very easy to add new methods, whether they’re custom re-implementations, or wrappers around existing libraries. (View Highlight)
  • Every method supported by rerankers is implemented around the Reranker class. It is used as the main interface to load models, no matter the underlying implementation or requirements. You can initialise a Reranker with a model name or path, with full HuggingFace Hub support, and a model_type parameter, which specifies the type of model you’re loading. By default, a Reranker will attempt to use the GPU and half-precision if available on your system, but you can also pass a dtype and device (when relevant) to further control how the model is loaded. API-based methods can be passed an API_KEY, although the better way is to use the API provider’s preferred environment variable. (View Highlight)
  • Similarly to how Reranker serves as a single interface to various models, RankedResults objects are a centralised way to represent the outputs of various models, themselves containing Result objects. Both RankedResults and Result are fully transparent, allowing you to iterate through RankedResults and retrieve their associated attributes. (View Highlight)
  • RankedResults and Result’s main aim is to serve as a helper. Most notably, each Result object stores the original document, as well as the score outputted by the model, in the case of pointwise methods. They also contain the document ID, and, optionally, document meta-data, to facilitate usage in production settings. The output of rank() is always a RankedResults object, and will always preserve all the information associated with the documents:

    Ranking a set of documents returns a RankedResults object, preserving meta-data and document-ids.

    results = ranker.rank(query=“I love you”, docs=[“I hate you”, “I really like you”], doc_ids=[0,1], metadata=[{‘source’: ‘twitter’}, {‘source’: ‘reddit’}]) results

    RankedResults(results=[Result(document=Document(text=‘I really like you’, doc_id=1, metadata={‘source’: ‘twitter’}), score=-2.453125, rank=1), Result(document=Document(text=‘I hate you’, doc_id=0, metadata={‘source’: ‘reddit’}), score=-4.14453125, rank=2)], query=‘I love you’, has_scores=True) (View Highlight)

  • Modularity rerankers is designed specifically with ease of extensibility in mind. All approaches are independently-implemented and have individually-defined sets of dependencies, which users are free to install or not based on their needs. Informative error messages are shown when a user attempts to load a model type that is not supported by their currently installed dependencies. Extensibility As a result, adding a new method simply requires making its inputs and outputs compatible with the rerankers-defined format, as well as a simple modification of the main Reranker class to specify a default model. This approach to modularity has allowed us to support all the approaches with minimal engineering efforts. We fully encourage researchers to integrate their novel methods into the library and will provide support for those seeking to do so. (View Highlight)