Cross-Domain Web Information Extraction at Pinterest

cs.AI updates on arXiv.org 22小时前

Cross-Domain Web Information Extraction at Pinterest

本文介绍了Pinterest开发的电商数据提取系统，该系统通过结合网页结构、视觉和文本信息，实现高效、低成本的产品数据结构化提取，在保证准确性的同时，处理速度和成本优势显著。

arXiv:2508.01096v1 Announce Type: cross Abstract: The internet offers a massive repository of unstructured information, but it's a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce websites is essential to enhance user experiences and improve content distribution. In this paper, we present Pinterest's system for attribute extraction, which achieves remarkable accuracy and scalability at a manageable cost. Our approach leverages a novel webpage representation that combines structural, visual, and text modalities into a compact form, optimizing it for small model learning. This representation captures each visible HTML node with its text, style and layout information. We show how this allows simple models such as eXtreme Gradient Boosting (XGBoost) to extract attributes more accurately than much more complex Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT). Our results demonstrate a system that is highly scalable, processing over 1,000 URLs per second, while being 1000 times more cost-effective than the cheapest GPT alternatives.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Pinterest 电商数据提取 XGBoost GPT 数据结构化

相关文章

DataRobot ‘Guard Models’ Keep GenAI on the Straight and Narrow

Scaling BERT and GPT for Financial Services with Jennifer Glore - #561

Diversification in Recommender Systems with Ahsan Ashraf - TWiML Talk #187

The Power Of Probabilistic Programming with Ben Vigoda - TWiML Talk #33

周鸿祎：留给谷歌的时间不多了，建议把所有产品都开源

惊喜发现又祛魅一项能力：读论文 CS 专业一路走来被论文折磨，现以为脱离苦海，但又不得不紧跟看 LLM SD 论文，痛点就是：看不下去，精神涣散?‍♂️啃能读完...

The Evolution of the GPT Series: A Deep Dive into Technical Insights and Performance Metrics From GPT-1 to GPT-4o

宝玉这篇长文是不错的GPT使用科普教程。我们很多媒体每次只会用一些夸张词汇来宣传有多【好用】，但从不说要怎么【用好】。这就导致有些普通人对于AI的理解出...

Ask HN: 值得为 WordPress 主题文件构建自定义 GPT 吗？

微軟7月終止支援消費者用的GPT開發工具