MarkTechPost@AI 03月14日
Google AI Introduces Gemini Embedding: A Novel Embedding Model Initialized from the Powerful Gemini Large Language Model
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google Gemini团队推出了Gemini Embedding模型,这是一种最先进的模型,能够生成高度通用的文本表示。该模型构建于Google强大的Gemini大型语言模型之上,利用多语言和代码理解能力来提高嵌入质量,适用于检索和语义相似性等各种任务。Gemini Embedding通过对比学习和微调在MMTEB上实现了最先进的性能,在多语言、英语和代码基准测试中超越了以往的模型。该模型通过高质量的异构数据集进行训练,并利用Gemini的过滤、正/负样本选择以及合成数据生成技术。

💪 Gemini Embedding模型基于Gemini的广泛知识,为检索、分类和排序等任务生成表示。它改进了Gemini的初始化参数,并应用池化策略来创建紧凑的嵌入。

🎯 该模型使用噪声对比估计(NCE)损失和批内负样本进行训练,同时采用多损失方法来调整跨子维度的嵌入。

📚 Gemini Embedding模型在多个基准测试中进行了评估,包括多语言、英语和基于代码的任务,涵盖超过250种语言。它在分类、聚类和检索性能方面表现出色,始终超越其他领先模型。

🌐 该模型即使在仅使用英语数据进行训练时,也能表现出强大的泛化能力,并在多语言基准测试中优于其他模型。为了提高质量,该模型受益于合成数据生成、数据集过滤和困难负样本挖掘。

Recent advancements in embedding models have focused on transforming general-purpose text representations for diverse applications like semantic similarity, clustering, and classification. Traditional embedding models, such as Universal Sentence Encoder and Sentence-T5, aimed to provide generic text representations, but recent research highlights their limitations in generalisation. Consequently, integrating LLMs has revolutionised embedding model development through two primary approaches: improving training datasets via synthetic data generation and hard negative mining, and leveraging pre-trained LLM parameters for initialisation. These methods significantly enhance embedding quality and downstream task performance but increase computational costs.

Recent studies have also explored adapting pre-trained LLMs for embedding tasks. Sentence-BERT, DPR, and Contriever have demonstrated the benefits of contrastive learning and language-agnostic training for embedding quality. More recently, models like E5-Mistral and LaBSE, initialised from LLM backbones such as GPT-3 and Mistral, have outperformed traditional BERT and T5-based embeddings. Despite their success, these models often require large in-domain datasets, leading to overfitting. Efforts like MTEB aim to benchmark embedding models across diverse tasks and domains, fostering more robust generalisation capabilities in future research.

The Gemini Embedding Team at Google introduces Gemini Embedding, a state-of-the-art model that generates highly generalisable text representations. Built on Google’s powerful Gemini large language model, it leverages multilingual and code comprehension capabilities to enhance embedding quality across diverse tasks such as retrieval and semantic similarity. The model is trained using a high-quality, heterogeneous dataset curated with Gemini’s filtering, selection of positive/negative passages, and generation of synthetic data. Gemini Embedding achieves state-of-the-art performance on the Massive Multilingual Text Embedding Benchmark (MMTEB) through contrastive learning and fine-tuning, surpassing previous models in multilingual, English, and code benchmarks.

The Gemini Embedding model builds on Gemini’s extensive knowledge to generate representations for tasks like retrieval, classification, and ranking. It refines Gemini’s initialised parameters and applies a pooling strategy to create compact embeddings. The model is trained using a noise-contrastive estimation (NCE) loss with in-batch negatives, while a multi-loss approach adapts embeddings across sub-dimensions. The training process includes a two-stage pipeline: pre-finetuning on large datasets and fine-tuning on diverse tasks. Additionally, model ensembling enhances generalisation. Gemini also aids in synthetic data generation, filtering, and hard negative mining to refine the model’s performance across multilingual and retrieval tasks.

The Gemini Embedding model was evaluated across multiple benchmarks, including multilingual, English, and code-based tasks, covering over 250 languages. It demonstrated superior classification, clustering, and retrieval performance, consistently surpassing other leading models. The model achieved the highest ranking based on Borda scores and excelled in cross-lingual retrieval tasks. Additionally, it outperformed competitors in code-related evaluations, even when certain tasks were excluded. These results highlight Gemini Embedding as a highly effective multilingual embedding model, capable of delivering state-of-the-art performance across diverse linguistic and technical challenges.

In conclusion, the Gemini Embedding model is a robust, multilingual embedding solution that excels across various tasks, including classification, retrieval, clustering, and ranking. It demonstrates strong generalisation even when trained on English-only data, outperforming other models on multilingual benchmarks. To enhance quality, the model benefits from synthetic data generation, dataset filtering, and hard negative mining. Future work aims to extend its capabilities to multimodal embeddings, integrating text, image, video, and audio. Evaluations on large-scale multilingual benchmarks confirm its superiority, making it a powerful tool for researchers and developers seeking efficient, high-performance embeddings for diverse applications.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post Google AI Introduces Gemini Embedding: A Novel Embedding Model Initialized from the Powerful Gemini Large Language Model appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Gemini Embedding 嵌入模型 多语言 人工智能
相关文章