With the release of the Aya Expanse family, featuring 8B and 32B parameter models, we are addressing one of the most urgent challenges in AI: the lack of highly performant multilingual models that can rival the capabilities of monolingual ones. While AI has made tremendous progress, there remains a stark gap in the performance of models across multiple languages. Aya Expanse is the result of several years of dedicated research at C4AI --- data arbitrage, multilingual preference training, safety tuning, and model merging. (View Highlight)
These combined breakthroughs have resulted in new state-of-the-art performance on multilingual. We evaluate our models on a set of evaluations including the Arena-Hard-Auto dataset (paper), translated to the 23 languages which we release for others to use here. In pairwise comparison, Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, a model more than 2x its size, setting a new state-of-the-art for multilingual performance. We also release Aya Expanse 8B, which outperforms the leading open-weights models in its parameter class such as Gemma 2 9B, Llama 3.1 8B, and the recently released Ministral 8B with win rates ranging from 60.4% to 70.6%. We observe even larger gains across less challenging evals. (View Highlight)
We release both models as open weights for the research community, and hope it will further accelerate multilingual progress. In this blog post, we share technical details behind each of the key algorithmic components used in the training pipeline. (View Highlight)
The use of synthetic data – data generated by an expert or “teacher” model to train another model – has become increasingly central to the development of LLMs, particularly as model training has exhausted current data sources. However, for multilingual data, especially with low-resource languages, there are few good examples of teacher models, creating an extra added challenge to leveraging synthetic data. Furthermore, recent research has suggested that an over-reliance on synthetic data leads to model collapse. (View Highlight)
In our recent work we demonstrate that these limitations can be addressed through “data arbitrage” – strategically sampling from a pool of teacher models. This approach has important implications as it challenges the traditional reliance on a single-teacher model for generating synthetic data. Instead, data arbitrage leverages performance variations among a pool of models. Although this technique is applicable to any domain, it is particularly suited to the multilingual setting, where the absence of a universally effective teacher that excels across all languages presents significant challenges In the creation of high-quality synthetic multilingual datasets, multilingual arbitrage proves valuable by utilizing a diverse pool of models to strategically sample different parts of the data distribution for improved multilingual generations. (View Highlight)
We first train a model pool for groups of languages and employ an Arbiter to evaluate and select the optimal generation. The Arbiter here is an internal reward model (RM) to score the model generations. In Reward-Based Routing, for each prompt in a given language, we generate completions from all models in the pool and score them using the reward model. The completion with the highest score is chosen as the final completion for that prompt. Our 8B model, even at the SFT stage trained with Multilingual Arbitrage, had over 9.1% improvement in win-rate measured against Gemma 2 9B compared to the previous Aya 23 model, demonstrating the effectiveness of this approach in leveraging diverse model strengths across languages. (View Highlight)
Following supervised fine-tuning, alignment to human preferences is a key step for training today’s state-of-the-art LLMs. Although heavily adopted, It is known that preference training is already challenging in a monolingual setting. Maximizing gains from preference training in a multilingual setting introduces even more challenges. The vast majority of existing preference datasets are exclusively English and the few existing multilingual preference datasets are often of low-quality. Moreover, modeling many diverse languages simultaneously is known to be a difficult optimization problem where naively optimizing for performance in some languages often leads to regressions in performance in other languages. (View Highlight)
In our recent work, we leverage a novel synthetic data generation technique to construct high-quality multilingual preference data pairs by contrasting in-language completions from a highly performant multilingual LLM with lower quality completions translated from English which were generated by a weaker model. This steers our model away from generating low-quality multilingual completions which often contain undesirable artifacts, such as those introduced by poor translation. We show that this method unlocks substantial gains in performance across all languages and often also results in gains for languages not included in the preference training data. (View Highlight)
While this work also shows that preference training with online data outperforms its offline variant, during training of Aya Expanse, we found that the combination of first preference-training with offline data followed by preference-training with online data to be better than either online or offline training alone. In the first preference training stage, we train on data curated by taking the highest and lowest reward responses from the Arbitrage stage as the chosen and rejected completions, which makes the first stage of DPO training offline. (View Highlight)
After offline preference training, we run online iterative DPO, where we sample multiple online generations for each prompt from the model trained during the last iteration, rank these generations with a Reward Model, and then further train on these preference pairs. For both models, we repeat this process for 3 iterations as we found that going beyond 3 iterations led to minimal gains at the cost of additional re-tuning parameters like regularization coefficient (beta) and sometimes introduced reward hacking behavior. Overall, for Aya Expanse 8B, the combination of offline and online preference training on top of the model trained with arbitrage, led to 7.1% additional gains in win rate against Gemma 2 9B. (View Highlight)
A reappearing problem throughout any post-training (and pre-training) pipeline, whether it consists of a single stage such as SFT, or a more complex multi-stage optimization pipeline, such as our pipeline above, is choosing the right data mixtures for training. The intricacies of this process demand considerable effort in fine-tuning hyperparameters and data combinations. Merging multiple models is an alternative approach for enabling complex multi-tasking at a reduced aggregate computational cost. In Aya Expanse, we directly build on the findings of our recent research paper and apply merging in both the Arbitrage phase, and at each iteration of preference training. (View Highlight)
When training multiple separate models with the goal of merging, it is important to maximize diversity between checkpoints. However, this should be balanced with ensuring that each individual model within the pool achieves high performance. To balance these objectives, we maximize diversity between checkpoints by training models for different language families. This takes advantage of cross-lingual transfer which often provides significant performance benefits while ensuring that linguistic differences provide sufficient differentiation between checkpoints. (View Highlight)
These diagrams show our end-to-end post-training pipeline, which resulted in the step-by-step gains discussed earlier. It is truly special to look back and see how far the Aya model series has come, since its inception with Aya 101 accompanied by the Aya Collection, which stretched the limits of open-source collaboration, to now which combines steady progress in key open fundamental research questions to set a new standard for multilingual performance. (View Highlight)