未知数据源 2024年11月26日
How to Engineer Your Way Out of Slow Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Taboola公司如何通过缓存模型的中间结果(嵌入)来解决神经网络模型推理速度慢的问题。文章以CTR预测模型为例,详细阐述了如何利用Cassandra数据库作为缓存,构建EmbArk、KFC、Retina等微服务组件,并利用Kafka作为异步消息队列,最终实现了图像、文本等多种模态数据的嵌入缓存,显著提升了模型推理速度。此外,文章还探讨了这种缓存方案的适用场景和限制条件,以及在迁移学习和重复项检测等方面的应用,为解决模型推理速度问题提供了一种可行的工程方案。

🤔 **缓存模型中间结果(嵌入)以提升推理速度:** Taboola公司通过缓存模型的中间结果(嵌入),例如图像、文本的嵌入,来解决模型推理速度慢的问题,从而满足生产环境的性能要求。

🖼️ **构建EmbArk、KFC、Retina等微服务组件:** 为了实现嵌入缓存,Taboola构建了EmbArk(嵌入存档)、KFC(Kafka前端客户端)和Retina(嵌入服务)等微服务组件,并利用Cassandra数据库作为缓存存储,Kafka作为异步消息队列。

🔄 **异步处理机制确保快速响应:** EmbArk在缓存未命中时,会异步地发送请求到Retina生成嵌入,并返回一个特殊的OOV嵌入,以保证快速响应客户端请求。

🚀 **迁移学习和重复项检测等应用:** EmbArk不仅可以用于提升模型推理速度,还可以应用于迁移学习和重复项检测等场景,例如利用Inception模型的嵌入进行CTR预测或图像和文本嵌入的重复项检测。

⚠️ **缓存方案的适用场景和限制条件:** 这种缓存方案需要满足三个条件:新的输入数据量较少,OOV嵌入对模型影响不大,嵌入不会频繁变化。否则,缓存方案的优势可能会失效。

So you just finished designing that great neural network architecture of yours.It has a blazing number of 300 fully connected layers interleaved with 200convolutionallayerswith 20 channels each, where the result is fed as the seed of a gloriousbidirectionalstackedLSTM with a pinchofattention.After training you get an accuracy of 99.99%, and you’re ready to ship it toproduction.

But then you realize the production constraints won’t allow you to run inferenceusing this beast. You need the inference to be done in under 200 milliseconds.

In other words, you need to chop off half of the layers, give up on usingconvolutions, and let’s not get started about the costly LSTM...

If only you could make that amazing model faster!

Sometimes you can

Here at Taboola we did it. Well, not exactly... Let me explain.

One of our models has to predict CTR (Click Through Rate) of an item, or inother words — the probability the user will like an article recommendation andclick on it.

The model has multiple modalities as input, each goes through a differenttransformation. Some of them are:

These processed modalities are then passed through fully connected layers inorder to learn the interactions between the modalities, and finally, they arepassed through aMDNlayer.

As you can imagine, this model is slow.

We decided to insist on the predictive power of the model, instead of trimmingcomponents, and came up with an engineering solution.

Cache me if you can

Let’s focus on the image component. The output of this component is a learnedrepresentation of the image. In other words, given an image, the image componentoutputs an embedding.

The model is deterministic, so given the same image will result with the sameembedding. This is costly, so we can cache it. Let me elaborate on how weimplemented it.

The architecture (of the cache, not the model)

This architecture was initially used for images. After proving its worth, wedecided to use it for other components as well, such as text.

EmbArk proved to be a nice solution for transferlearning too. Let’s say we believe the contentof the image has a good signal for predicting CTR. Thus, a model trained forclassifying the object in an image such asInceptionwould be valuable for our needs. We can load Inception into Retina, tell themodel we intend to train that we want to use Inception embedding, and that’s it.

Not only that the inference time was improved, but also the training process.This is possible only when we don’t want to train end to end, since gradientscan’t backpropagate through EmbArk.

So whenever you use a model in production you should use EmbArk, right? Well,not always...

Caveats

There are three pretty strict assumptions here.

1. OOV embedding for new inputs is not a big deal

It doesn’t hurt us that the first time we see an image we won’t have itsembedding.

In our production system it’s ok, since CTR is evaluated multiple times for thesame item during a short period of time. We create lists of items we want torecommend every few minutes, so even if an item won’t make it into the listbecause of non optimal CTR prediction, it will in the next cycle.

2. The rate of new inputs is low

It’s true that in Taboola we get lots of new items all the time. But relative tothe number of inferences we need to perform for already known items are not thatmuch.

3. Embeddings don’t change frequently

Since the embeddings are cached, we count on the fact they don’t change overtime. If they do, we’ll need to perform cache invalidation, and recalculate theembeddings using Retina. If this would happen a lot we would lose the advantageof the architecture. For cases such as inception or language modeling, thisassumption holds, since semantics don’t change significantly over time.

Some final thoughts

Sometimes using state of the art models can be problematic due to theircomputational demands. By caching intermediate results (embeddings) we were ableto overcome this challenge, and still enjoy state of the art results.

This solution isn’t right for everyone, but if the three aforementionedassumptions hold for your application, you could consider using a similararchitecture.

By using a microservices paradigm, other teams in the company were able to useEmbArk for needs other than CTR prediction. One team for instance used EmbArk toget image and text embeddings for detecting duplicates across different items.But I’ll leave that story for another post...


Originally published by me atengineering.taboola.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

模型推理 缓存 嵌入 微服务 CTR预测
相关文章