Full Title: Introducing Marqo Specialized Embedding Models for Ecommerce: Powering Multimodal AI Search
Document Note: Q: What could be the impact of having specialised models like this one for marketplaces? Answer in less than 300 chars
A: Specialized models like Marqo-Ecommerce can significantly enhance search accuracy and speed in marketplaces, leading to improved customer satisfaction and higher conversion rates. By delivering more relevant product recommendations and search results, they can boost sales and customer retention.
Our benchmarking process was divided into two distinct regimes, each using different datasets of ecommerce product listings: marqo-ecommerce-hard and marqo-ecommerce-easy. Both datasets contained product images and text and only differed in size. The “easy” dataset is approximately 10-30 times smaller (200k vs 4M products), and designed to accommodate rate-limited models, specifically Cohere-Embeddings-v3 and GCP-Vertex (with limits of 0.66 rps and 2 rps respectively). The “hard” dataset represents the true challenge, since it contains four million ecommerce product listings and is more representative of real-world ecommerce search scenarios. For both the marqo-ecommerce-hard and marqo-ecommerce-easy datasets, the models were benchmarked on three different tasks: (View Highlight)
• GoogleShopping-Text2Image: uses the product title to search product images from Google Shopping data. This is representative of descriptive queries in search.
• GoogleShopping-Category2Image: uses the product categories as queries to search product images from Google Shopping data. This is analogous to short keyword like queries in search.
• AmazonProducts-Text2Image: uses the product title to search product images from Amazon product data. This is representative of descriptive queries in search.
We have made these datasets available on Hugging Face along with scripts to reproduce the evaluation.
The benchmarking results show that the Marqo-Ecommerce models consistently outperformed all other models across various metrics. Specifically, marqo-ecommerce-L achieved an average improvement of 17.6% in MRR and 20.5% in nDCG@10 when compared with the current best open source model, ViT-SO400M-14-SigLIP across all three tasks in the marqo-ecommerce-hard dataset. When compared with the best private model, Amazon-Titan Multimodal, we saw an average improvement of 38.9% in MRR and 45.1% in nDCG@10 across all three tasks, and 35.9% in Recall across the Text-to-Image tasks in the marqo-ecommerce-hard dataset. (View Highlight)
While contrastive learning models like CLIP and SigLIP are powerful, they are not optimized for the needs of ecommerce. They were trained on a large collection of images, many of which aren’t related to ecommerce, with little curation or domain specificity. The product data in ecommerce datasets differs significantly from general-purpose datasets, resulting in suboptimal performance when these models are used for search and recommendations. Additionally, these models were trained on data that is now several years old, and they have no understanding of recent products or trends. (View Highlight)
We built Marqo-Ecommerce-B and Marqo-Ecommerce-L models, which excel at ecommerce search, retrieval, and recommendation tasks. The models were trained on 100s of millions of samples from ~50 million unique products across 20,000 Amazon asin categories spanning from appliances to automotive to office products to pet supplies. The models were evaluated on extensive benchmark datasets that spanned over 4 million unique products covering the 20,000 categories. The categories are taken from Amazon’s product taxonomy. (View Highlight)
The Marqo-Ecommerce embedding models are designed specifically to work seamlessly with Marqo Cloud, our end-to-end embeddings platform. Additionally, you can fine tune our embedding models on your own product catalogs and user behavior, using Marqtune. Marqtune is our embedding model training platform backed by our contrastive learning framework - GCL. (View Highlight)
We’ve released both our models, Marqo-Ecommerce-B and Marqo-Ecommerce-L on Hugging Face. The B model is smaller and faster for inference (with times of 5.1 ms for single batch text, and 5.7 ms for image) and a smaller embedding dimension (768). The L model is larger (652M parameters), has a larger embedding dimension (1024), but has better retrieval performance. Marqo-Ecommerce-L has up to 7.3% MRR and 7.4% nDCG@10 average improvement over Marqo-Ecommerce-B across the three tasks on the 4M evaluation dataset. (View Highlight)
Here are the detailed results for three general ecommerce retrieval tasks. These tasks measure the performance of various embedding models in retrieving images based on long and short text descriptions and categories. We focus on Precision, Recall, MRR (Mean Reciprocal Rank), and nDCG to showcase how our Marqo-Ecommerce models stack up against existing solutions, such as Amazon Titan Multimodal and other popular open-weights SigLIP ViT models from Google. (View Highlight)
With the release of Marqo-Ecommerce-B and Marqo-Ecommerce-L, ecommerce platforms now have access to powerful, purpose-built embedding models that outperform existing solutions by up to 88%. These models are specifically tailored for the unique challenges of ecommerce, delivering highly accurate retrieval results, whether it’s matching product titles to images or associating products with broader categories. The Marqo-Ecommerce models are set to transform search, retrieval, and recommendation tasks in the ecommerce industry. (View Highlight)