We are excited to introduce our new and powerful embeddings model. It comes with an Apache 2.0 license and is available on Hugging Face. (View Highlight)
A significant hurdle for modern generative models is their inability to directly interact with your data. Consider a scenario where your task is to generate a report on recent market trends based on internal research documents. Traditional generative models fall short here as they don’t have access to or understanding of your internal documents, making it impossible for them to generate the required report. (View Highlight)
To address this challenge, the Retrieval-Augmented Generation (RAG) technique offers a solution. Imagine you have a repository of internal research on market trends. This repository can be processed through an embeddings model to convert the documents into a searchable format within a vector database. When you need a report on market trends, the embeddings model can locate and fetch the most relevant documents. These documents can then inform a generative model, enabling it to produce a detailed report based on your specific data. (View Highlight)
Today, we want to return some of our attention to slightly more well-charted waters: we are releasing our flagship, state-of-the-art English embeddings model, which can be easily downloaded into your existing search pipeline. No fancy custom code or trust remote code required. (View Highlight)
As of March 2024, our model achieves state-of-the-art performance for open-source models of the same size class (for closed source models this information is not public) on the Massive Text Embedding Benchmark (MTEB). (View Highlight)
Our model is extremely easy to use with your existing search stack. You replace the first stage retrieval with our model, and you’re ready to go. You’ll have two options: use the model either offline by hosting it yourself or online by using our (upcoming) API. (View Highlight)
The prompt improves the model’s understanding of how the embedding will be used in subsequent tasks, which in turn increases the performance. For now, we support only one prompt, but our experiments show that having domain specific prompts can increase the performance. If you are doing information retrieval, please use the prompt Represent this sentence for searching relevant passages: for your query. For everything else, just use the text as it is. (View Highlight)
While a lot of models use ready-made datasets — which are pretty outdated and also quite far removed from real world use cases — we spent a lot of time building our own datasets. We scraped a large part of the internet, cleaned the data, and used it to construct our training dataset. (View Highlight)
MTEB is a large text embedding benchmark that measures embeddings models across seven tasks: classification, clustering, pair classification, re-ranking, retrieval, STS (semantic textual similarity), and summarization. It includes 56 datasets from various domains and with various text lengths. (View Highlight)
Our new model is ranked first among embeddings models of similar size, outperforms the new OpenAI embeddings model, text-embedding-3-large, and also matches the performance of 20x larger models like echo-mistral-7b. You can find the evaluation results on the official MTEB leaderboard. (View Highlight)
PS: You can boost the performance even further by combining our embeddings model with our rerank model. (View Highlight)
The results on MTEB show that our model performs well for a number of different tasks and domains. This means that the model can adapt to a plethora of use cases and topics, making it an obvious choice for users. (View Highlight)
Recently, we’ve observed that some models are advertised as supporting long context to mitigate chunking. While we recognize that chunking sucks, which is also something we are working to solve, using a long context model is not the solution. (View Highlight)
With embeddings, we aim to capture the semantics of a text. For illustrative purposes, think of your own long context documents. They can contain any amount of different information and multiple topics which can be unrelated or contradictory. Accurately representing this with a single embedding is almost prohibitively difficult, which is why we decided not to support long context and to solve this issue in a smarter, more sensible way instead. Stay tuned, cool stuff is coming soon! (View Highlight)
This is our first production-ready embeddings model, and we greatly welcome any feedback that helps make our models better, refine their user-friendliness, or improve their capabilities. Please let us know if you’re hungry for any new features or have encountered any issues. We value your feedback! (View Highlight)