So you just finished designing that great neural network architecture of yours.It has a blazing number of 300 fully connected layers interleaved with 200convolutionallayerswith 20 channels each, where the result is fed as the seed of a gloriousbidirectional stacked LSTM with a pinchofattention.After training you get an accuracy of 99.99%, and you’re ready to ship it toproduction.

But then you realize the production constraints won’t allow you to run inferenceusing this beast. You need the inference to be done in under 200 milliseconds.

In other words, you need to chop off half of the layers, give up on usingconvolutions, and let’s not get started about the costly LSTM...

If only you could make that amazing model faster!

Sometimes you can

Here at Taboola we did it. Well, not exactly... Let me explain.

One of our models has to predict CTR (Click Through Rate) of an item, or inother words — the probability the user will like an article recommendation andclick on it.

The model has multiple modalities as input, each goes through a differenttransformation. Some of them are:

embedded

self attention

These processed modalities are then passed through fully connected layers inorder to learn the interactions between the modalities, and finally, they arepassed through aMDNlayer.

As you can imagine, this model is slow.

We decided to insist on the predictive power of the model, instead of trimmingcomponents, and came up with an engineering solution.

Cache me if you can

Let’s focus on the image component. The output of this component is a learnedrepresentation of the image. In other words, given an image, the image componentoutputs an embedding.

The model is deterministic, so given the same image will result with the sameembedding. This is costly, so we can cache it. Let me elaborate on how weimplemented it.

The architecture (of the cache, not the model)

This architecture was initially used for images. After proving its worth, wedecided to use it for other components as well, such as text.

EmbArk proved to be a nice solution for transferlearning too. Let’s say we believe the contentof the image has a good signal for predicting CTR. Thus, a model trained forclassifying the object in an image such asInceptionwould be valuable for our needs. We can load Inception into Retina, tell themodel we intend to train that we want to use Inception embedding, and that’s it.

Not only that the inference time was improved, but also the training process.This is possible only when we don’t want to train end to end, since gradientscan’t backpropagate through EmbArk.

So whenever you use a model in production you should use EmbArk, right? Well,not always...

Caveats

There are three pretty strict assumptions here.

1. OOV embedding for new inputs is not a big deal

It doesn’t hurt us that the first time we see an image we won’t have itsembedding.

In our production system it’s ok, since CTR is evaluated multiple times for thesame item during a short period of time. We create lists of items we want torecommend every few minutes, so even if an item won’t make it into the listbecause of non optimal CTR prediction, it will in the next cycle.

2. The rate of new inputs is low

It’s true that in Taboola we get lots of new items all the time. But relative tothe number of inferences we need to perform for already known items are not thatmuch.

3. Embeddings don’t change frequently

Since the embeddings are cached, we count on the fact they don’t change overtime. If they do, we’ll need to perform cache invalidation, and recalculate theembeddings using Retina. If this would happen a lot we would lose the advantageof the architecture. For cases such as inception or language modeling, thisassumption holds, since semantics don’t change significantly over time.

Some final thoughts

Sometimes using state of the art models can be problematic due to theircomputational demands. By caching intermediate results (embeddings) we were ableto overcome this challenge, and still enjoy state of the art results.

This solution isn’t right for everyone, but if the three aforementionedassumptions hold for your application, you could consider using a similararchitecture.

By using a microservices paradigm, other teams in the company were able to useEmbArk for needs other than CTR prediction. One team for instance used EmbArk toget image and text embeddings for detecting duplicates across different items.But I’ll leave that story for another post...

Originally published by me atengineering.taboola.com.

Sometimes you can

Cache me if you can

The architecture (of the cache, not the model)

Caveats

1. OOV embedding for new inputs is not a big deal

2. The rate of new inputs is low

3. Embeddings don’t change frequently

Some final thoughts

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签