MarkTechPost@AI 2024年12月08日
Snowflake Releases Arctic Embed L 2.0 and Arctic Embed M 2.0: A Set of Extremely Strong Yet Small Embedding Models for English and Multilingual Retrieval
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Snowflake 近期发布了两款小而强大的嵌入模型:Arctic Embed L 2.0 和 Arctic Embed M 2.0,专为多语言搜索和检索而设计。这两款模型有中型和大型两种变体,支持高达 8,192 个 token 的上下文长度,具有广泛应用场景。Arctic Embed 2.0 在保持其前身卓越的英语检索能力的同时,提供跨多种语言的高质量检索。它在多语言基准测试中表现出色,并在 NVIDIA A10 GPU 上展现出高效的处理能力。此外,该模型支持 Matryoshka 表示学习 (MRL),可将嵌入压缩至每个向量 128 字节。Arctic Embed 2.0 采用 Apache 2.0 许可,允许组织修改和部署模型,确保其在各个行业和用例中的广泛适用性。

🚀Snowflake 推出了 Arctic Embed L 2.0 和 Arctic Embed M 2.0,这是两款针对多语言搜索和检索而设计的小型且强大的嵌入模型,它们基于阿里巴巴的 GTE 多语言框架构建,支持高达 8,192 个 token 的上下文长度,使其适用于需要广泛上下文理解的应用。

⚖️Arctic Embed 2.0 的创新之处在于它能够在多种语言中提供高质量检索,同时保持其前身卓越的英语检索能力。Snowflake 的团队仔细平衡了这些多语言需求,使 Arctic Embed 2.0 甚至在 MTEB 检索基准等英语基准测试中胜过仅限英语的模型。

⚡️尽管相对于其他前沿模型而言,Arctic Embed 2.0 模型的尺寸较小,但它们提供了快速的嵌入吞吐量。在 NVIDIA A10 GPU 上进行的测试显示,大型模型每秒可处理 100 多个文档,查询嵌入延迟低于 10 毫秒。这种效率有助于在具有成本效益的硬件上进行部署,这对于管理大规模数据的企业来说是一个关键优势。

🗜️该版本还包括高级功能,例如 Matryoshka 表示学习 (MRL),这是一种专为可扩展检索而设计的技术。借助 MRL,用户可以将嵌入压缩至每个向量仅 128 字节,压缩比比某些专有模型(如 OpenAI 的 text-embedding-3-large)的未压缩嵌入小 96 倍。

🌐Arctic Embed 2.0 在 MIRACL 等域内评估和通过 CLEF 基准测试的域外场景中均表现出色。这种泛化能力是对早期模型的重要改进,早期模型通常表现出对特定数据集的过拟合倾向。

Snowflake recently announced the launch of Arctic Embed L 2.0 and Arctic Embed M 2.0, two small and powerful embedding models tailored for multilingual search and retrieval. The Arctic Embed 2.0 models are available in two distinct variants: medium and large. Based on Alibaba’s GTE-multilingual framework, the medium model incorporates 305 million parameters, of which 113 million are non-embedding parameters. The large variant builds on a long-context adaptation of Facebook’s XMLR-Large and houses 568 million parameters, including 303 million non-embedding parameters. Both models support context lengths of up to 8,192 tokens, making them versatile for applications requiring extensive contextual understanding.

The innovation behind Arctic Embed 2.0 lies in its ability to provide high-quality retrieval across multiple languages while retaining its predecessors’ superior English retrieval capabilities. Snowflake’s team carefully balanced these multilingual demands, enabling Arctic Embed 2.0 to outperform even English-only models in English-language benchmarks such as the MTEB Retrieval benchmark. Also, these models demonstrated remarkable performance on multilingual benchmarks, including CLEF and MIRACL, achieving higher nDCG@10 scores across languages like German, French, Spanish, and Italian.

Despite their compact size relative to other frontier models, Arctic Embed 2.0 models deliver rapid embedding throughput. Testing on NVIDIA A10 GPUs revealed the large model’s capacity to process over 100 documents per second with sub-10ms query embedding latency. This efficiency facilitates deployment on cost-effective hardware, a crucial advantage for enterprises managing large-scale data. The release also includes advanced features such as Matryoshka Representation Learning (MRL), a technique designed for scalable retrieval. With MRL, users can compress embeddings to as little as 128 bytes per vector, a compression ratio 96 times smaller than the uncompressed embeddings of some proprietary models like OpenAI’s text-embedding-3-large. 

Arctic Embed 2.0, released under the Apache 2.0 license, allows organizations to modify and deploy models, ensuring wide applicability across various industries and use cases. This move underscores Snowflake’s dedication to democratizing AI tools, as highlighted by Clément Delangue, CEO of Hugging Face, who praised the contribution of these models to the global AI community. The models excel in in-domain evaluations like MIRACL and out-of-domain scenarios tested through CLEF benchmarks. This generalization is a critical improvement over earlier models, which often showed overfitting tendencies toward specific datasets.

Compared with other open-source and proprietary models, Arctic Embed 2.0 is a leader in multilingual and English-language retrieval quality. While some existing models force users to choose between maintaining high English retrieval performance or adding operational complexity for multilingual support, Arctic Embed 2.0 offers a unified solution. Its multilingual embeddings eliminate the need for separate models, simplifying workflows while achieving top-tier results. Another highlight of this release is its support for enterprise-grade retrieval at scale. The models’ compact embeddings and robust performance make them ideal for businesses aiming to handle vast document repositories efficiently.

In conclusion, Arctic Embed L 2.0 and Arctic Embed M 2.0 represent a leap in multilingual embedding models. With their unparalleled efficiency, scalability, and quality, these models set a new standard for global-scale retrieval tasks. Snowflake’s release empowers organizations to address multilingual challenges effectively and reinforces its role as a trailblazer in the AI landscape.


Check out the Arctic Embed L 2.0 and Arctic Embed M 2.0. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post Snowflake Releases Arctic Embed L 2.0 and Arctic Embed M 2.0: A Set of Extremely Strong Yet Small Embedding Models for English and Multilingual Retrieval appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Snowflake Arctic Embed 多语言检索 嵌入模型 人工智能
相关文章