Marqo Releases Marqo-FashionCLIP and Marqo-FashionSigLIP: A Family of Embedding Models for E-Commerce and Retail

MarkTechPost@AI 2024年08月17日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Marqo发布了两个新的用于时尚领域搜索和推荐的最新多模态模型，Marqo-FashionCLIP 和 Marqo-FashionSigLIP。这两个模型能够为文本和图像生成嵌入，用于后续的搜索和推荐系统。它们是在超过一百万件服装商品上训练的，这些商品包含丰富的元数据，包括材质、颜色、款式、关键词和描述。

😊 **模型训练与优化：** Marqo-FashionCLIP 和 Marqo-FashionSigLIP 是通过使用 GCL 对两个现有的基础模型（ViT-B-16-laion 和 ViT-B-16-SigLIP-webli）进行微调而训练出来的。模型的优化目标是七部分损失，包括关键词、类别、细节、颜色、材质和详细描述。这种多部分损失在对比学习和微调方面远优于传统的文本-图像 InfoNCE 损失，从而产生了在处理较短的描述性文本和类似关键词的材料时能够获得更好搜索应用程序结果的模型。

🤩 **模型评估与性能：** 研究人员使用七个公开可用的时尚数据集（未包含在训练数据集中）对模型进行了评估，包括 iMaterialist、DeepFashion（店内）、DeepFashion（多模态）、Fashion200K、KAGL、Atlas 和 Polyvore。每个数据集都与不同的下游活动相关联，具体取决于可用的元数据。评估的三个主要重点是文本与图像之间的交互、类别与产品之间的交互以及子类别与产品之间的交互。文本到图像的任务使用不同的文本部分来模拟更长的描述性查询（例如尾部查询）。较短的类似关键词的查询（类似于头部查询）可能包含产品任务类别和子类别，代表多个有效结果。在全面的性能比较中，Marqo-FashionCLIP 和 Marqo-FashionSigLIP 在各个方面都超越了其时尚特定和基本模型的前身。例如，与 FashionCLIP2.0 相比，Marqo-FashionCLIP 在 recall@1（文本到图像）和 precision@1（类别/子类别到产品）方面分别提高了 22%、8% 和 11%。类似地，Marqo-FashionSigLIP 的 recall@1 为 57%，precision@1 为 11%，recall@1 为 13%，证明了它优于其他模型。

😎 **模型发布与应用：** 研究人员已使用 Apache 2.0 许可证发布了 Marqo-FashionCLIP 和 Marqo-FashionSigLIP。用户可以使用标准实现从 Hugging Face 直接下载并将其用于任何地方。

🤗 **模型优势与影响：** Marqo-FashionCLIP 和 Marqo-FashionSigLIP 在各种查询长度（从简单的类别到详细的描述）中都表现出色。根据查询类型细分的评估结果证明了模型在不同查询长度和类型下的鲁棒性。提出的模型在性能和效率方面都表现出色。与当前的时尚特定模型相比，它们在推理时间方面提高了 10%。

When it comes to fashion recommendation and search algorithms, multimodal techniques merge textual and visual data for better accuracy and customization. Users can use the system’s ability to assess visual and textual descriptions of clothes to get more accurate search results and personalized recommendations. These systems provide a more natural and context-aware way to shop by combining picture recognition with natural language processing, helping users discover clothing that fits their tastes and preferences well.

Margo releases two new state-of-the-art multimodal models for fashion domain search and recommendations, Marqo-FashionCLIP and Marqo-FashionSigLIP. For use in subsequent search and recommendation systems, Marqo-FashionCLIP and Marqo-FashionSigLIP can generate embeddings for both text and images. More than one million fashion items with extensive meta-data, including materials, colors, styles, keywords, and descriptions, were used to train the models.

The team used two pre-existing base models (ViT-B-16-laion and ViT-B-16-SigLIP-webli) to fine-tune the models using GCL. The seven-part loss is optimized for keywords, categories, details, colors, materials, and extensive descriptions. This multi-part loss was far superior to the conventional text-image InfoNCE loss concerning contrastive learning and fine-tuning. This produces a model that yields better search application results when dealing with shorter descriptive text and keyword-like material.

Researchers used seven publicly available fashion datasets, which were not part of the training dataset, were used to evaluate the models. This includes iMaterialist, DeepFashion (In-shop), DeepFashion (Multimodal), Fashion200K, KAGL, Atlas, and Polyvore. Each dataset is linked to distinct downstream activities depending on the available metadata. Interactions between text and pictures, categories and products, and subcategories and products were the three main foci of the evaluation. The text-to-image task mimics longer descriptive inquiries (such as tail queries) using distinct text sections. Shorter keyword-like inquiries (similar to head queries) that may have the product task category and subcategory represent several valid results.

In a comprehensive performance comparison, Marqo-FashionCLIP and Marqo-FashionSigLIP outshine their fashion-specific and basic models’ predecessors in every aspect. For Instance, Marqo-FashionCLIP achieved recall@1 (text-to-image) and precision@1 (category/sub-category-to-product) improvements of 22%, 8%, and 11% respectively, compared to FashionCLIP2.0. Similarly, Marqo-FashionSigLIP achieved recall@1 of 57%, precision@1 of 11%, and recall@1 of 13%, demonstrating its superiority over other models.

The study covers various query lengths, from simple categories to extensive descriptions. The results, broken down by query type, demonstrate the robustness of the models across different query lengths and types. The proposed models, Marqo-FashionCLIP and Marqo-FashionSigLIP, deliver superior performance and ensure efficiency. When compared to current fashion-specific models, they offer a 10% improvement in inference times.

Using the Apache 2.0 license, researchers have released Marqo-FashionCLIP and Marqo-FashionSigLIP. Using their standard implementation, users may download it straight from Hugging Face and use it anywhere.

Check out the Details and Model Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Introduces Arcee Swarm: A Groundbreaking Mixture of Agents MoA Architecture Inspired by the Cooperative Intelligence Found in Nature Itself

The post Marqo Releases Marqo-FashionCLIP and Marqo-FashionSigLIP: A Family of Embedding Models for E-Commerce and Retail appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签