ByteByteGo 前天 23:40
How Uber Eats Deduplicates Hundreds of Millions of Product Images
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Uber Eats 平台在业务扩展至生鲜、酒水等领域时,原有的图片处理流程因无法识别重复内容而面临巨大压力,导致网络带宽、处理时间和存储成本急剧上升。为解决此问题,Uber Eats 引入了基于内容寻址的缓存机制,通过计算图片的加密哈希值来识别和去重。新的图片处理管道能够根据图片内容而非 URL 来复用已处理的图像,并利用 HTTP Last-Modified 头部来检测同一 URL 下内容的更新。这一优化显著降低了重复下载、转换和存储,将图片处理流程从低效、冗余转变为精简、高吞吐,实现了 99% 以上的请求无需重新处理,大幅提升了性能和效率,且新架构部署仅用时不到两个月。

💡 **内容寻址实现图片去重与效率提升:** Uber Eats 过去的处理方式将相同图片的不同 URL 视为独立个体,造成大量重复下载、处理和存储。新管道通过计算图片的加密哈希值,无论来源或 URL 如何,相同内容将拥有相同哈希,从而实现内容级别的去重,大幅减少了不必要的资源消耗。

⚙️ **优化处理流程以适应业务扩展:** 随着 Uber Eats 扩展到更多品类,图片处理的复杂性增加。新流程设计了三种处理路径:已缓存直接返回、新图下载处理后返回、已知内容但需新处理规范。这种精细化管理确保了对不同情况的高效响应。

🔄 **通过 Last-Modified 头部处理 URL 更新:** 针对商家可能不更改 URL 而更新图片内容的情况,新系统利用 HTTP Last-Modified 头部来检测内容变化。如果 Last-Modified 时间戳发生改变,则重新处理;否则,复用缓存的哈希值,确保了数据的及时性且避免了不必要的重复工作。

📊 **数据存储与处理规范:** 图片(原始及处理后)存储在 Terrablob 中,元数据(如映射、处理规范)则存储在 Docstore 中。每个图片处理请求都包含详细的处理规范,如分辨率、格式、尺寸等,这些与图片哈希共同构成唯一标识,加速了处理过程并缓存了失败结果,避免重复无效尝试。

🚀 **快速部署与显著成效:** 新的图片处理架构在不到两个月的时间内成功上线,并支持了 Uber Eats 的高流量数据路径。该优化使得超过 99% 的请求无需重新处理,中位延迟仅为 100 毫秒,P90 延迟低于 500 毫秒,极大地提升了性能和可靠性。

Take Control of User Access with WorkOS (Sponsored)

As your app grows, managing “who can do what” becomes complex. Hard-coded roles and scattered permissions slow you down and fail to meet enterprise demands for fine-grained access.

WorkOS RBAC is the fastest way to implement structured, scalable permissions. Define roles, group permissions, and update access for entire user groups in a single step. With developer-friendly APIs, a powerful dashboard, and native integrations for SSO and Directory Sync, WorkOS gives you enterprise-grade access control out of the box.

Integrate RBAC with WorkOS


Disclaimer: The details in this post have been derived from the official documentation shared online by the Uber Eats Engineering Team. All credit for the technical details goes to the Uber Eats Engineering Team.  The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

At the scale of Uber Eats, image handling is an operational necessity. The platform manages hundreds of millions of product images, with millions of updates flowing through the system every hour. Every image carries cost: network bandwidth, processing time, storage space, and CDN footprint.

As Uber Eats expanded beyond restaurants into groceries, alcohol, and household items, the image pipeline started to strain. For example, a single product, like a can of Coca-Cola, might appear across thousands of storefronts. However, the backend treated each appearance as a fresh upload. There was no concept of shared assets across merchants. Each upload triggered a new download, a new transformation, and a new storage operation, even when the image was identical to one already in the system.

The old approach also assumed that a URL change would accompany any image change. It didn’t track content updates if the URL stayed the same. This blocked image refreshes, resulting in awkward workarounds.

The engineering goal was clear: reduce unnecessary processing, cut down storage and CDN costs, and reuse existing work wherever possible. In this article, we will look at how Uber achieved this goal of de-duplicating hundreds of millions of images.

The Limitations of the Old System

The original image pipeline operated on a simple assumption: if the URL is new, the image must be new. If the URL is the same, skip everything. 

However, there was no mechanism to detect whether two different URLs pointed to the same image. The system treated every incoming URL as unique, even if the underlying image bytes were identical. As a result, the same image, uploaded by different merchants or listed in different contexts, would be downloaded, processed, and stored multiple times.

See the diagram below:

Even worse, the system couldn’t detect content changes when the URL stayed constant. If a merchant updated an image without modifying the URL, the system ignored it entirely. There was no validation, no reprocessing, no cache invalidation.

This had a few major disadvantages:

The New Image Pipeline

The redesigned image pipeline shifts focus from URLs to actual image content. 

Instead of relying on external signals like URL changes, the system now uses content-addressable caching. Every image is identified by a cryptographic hash of its bytes. If two images are identical, their hashes match, regardless of where they came from or what URL they used.

This change enables the system to reuse work across uploads, merchants, and catalog updates without relying on fragile assumptions.

The new image service follows three main paths, depending on what it knows about the image:

See the diagram below that shows the three flows:

To support these flows, the system maintains three logical maps as shown in the diagram below:

Each map handles a distinct concern: identifying content, linking processed outputs, and tracking raw assets.

The storage details are as follows:

Processing Specifications

Every image transformation request includes a processing specification that defines exactly how the image should be handled. This includes:

Together, the image hash and the processing spec form a unique key. If that combination has been processed before, the system can return the result immediately without doing any work. This caching mechanism applies equally to successful transformations and known failures.

Errors are treated as first-class results. For example, if an uploaded image is too small to meet the requested resolution, the system logs the failure in the Processed Image Map using the same hash-plus-spec key. The next time that image comes in with the same spec, the system skips the download and transformation and returns the cached error.

This avoids repeated failures on the same bad input and prevents wasted compute cycles on requests that are guaranteed to fail. It also makes error reporting faster and more consistent across clients.

Handling Image Updates Behind Stable URLs

Not every image update comes with a new URL. Merchants often replace the content behind a URL without changing the URL itself. In the old system, this meant updates were silently ignored. The system assumed that a known URL always pointed to the same image, which led to stale or incorrect data being served.

To solve this, the new pipeline uses the HTTP Last-Modified header to detect whether an image has changed behind the same URL. During image processing:

See the diagram below:

This approach allows merchants to maintain stable URLs while still delivering updated content. The image pipeline respects those updates without blindly reprocessing every request. It also avoids unnecessary work when nothing has changed.

Conclusion

The new content-addressable image pipeline transformed a noisy, redundant workflow into a lean, high-throughput system. 

By deduplicating at the content level, the system avoids repeated downloads, transformations, and storage. It handles image updates gracefully, even when merchants reuse URLs. These changes significantly reduced infrastructure demands while improving reliability.

Perhaps most impressive is the speed of delivery. The new architecture rolled out in under two months, yet supports one of Uber Eats’ highest-volume data paths.

It’s a strong example of how targeted improvements in core systems can unlock broader product velocity, especially when the solution is fast, scalable, and simple to reason about.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Uber Eats 图片处理 内容寻址 去重 效率提升
相关文章