未知数据源 2024年11月26日
Uncertainty for CTR Prediction: One Model to Clarify Them All
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Taboola团队开发了一种神经网络模型,能够同时估计推荐项目与用户的相关概率以及预测的不确定性。该模型包含多个模块,分别处理项目特征、上下文信息,并融合两者进行预测。此外,它还引入了数据不确定性和测量不确定性,通过混合密度网络(MDN)等技术,能够更好地解释数据并提高模型的鲁棒性。最终,模型能够识别出数据稀疏区域,并通过鼓励探索这些区域来提升推荐效果,从而实现更精准和有效的推荐。

🤔 **项目模块**: 该模块提取项目的特征,例如标题和缩略图,并将其转换为密集的向量表示,用于后续的预测和分析。

🌍 **上下文模块**: 该模块提取上下文特征,例如发布者、用户、时间等,并将其转换为密集的向量表示,用于反映推荐环境。

🤝 **融合模块**: 该模块将项目和上下文的信息融合在一起,类似于协同过滤,以捕获两者之间的互动关系。

📊 **估计模块**: 该模块预测点击率(CTR),并同时估计CTR预测的不确定性。它考虑了数据不确定性、测量不确定性和模型不确定性,并使用混合密度网络(MDN)来建模这些不确定性。

💡 **不确定性探索**: 模型能够识别出数据稀疏区域,并通过鼓励探索这些区域来提升推荐效果,例如,对于独特的标题或稀疏的项目特征,模型会表现出更高的不确定性,从而促使系统进行更多探索。

This is a joint post with Inbar Naor. Originally published at engineering.taboola.com.


In the firstpost of theseries we discussed three types of uncertainty that can affect your model — datauncertainty, model uncertainty and measurement uncertainty. In the secondpost wetalked about various methods to handle the model uncertainty specifically.Then, in our thirdpostwe showed how we can use the model’s uncertainty to encourage exploration of newitems in recommender systems.

Wouldn’t it be great if we can handle all three types of uncertainty in aprincipled way using one unified model? In this post we’ll show you how we atTaboola implemented a neural network that estimates both the probability of anitem being relevant to the user, as well as the uncertainty of this prediction.

Let’s jump to the deep water

A picture is worth a thousand words, isn’t it? And a picture containing athousand neurons?...

In any case, this is the model we use. The model is composed of several modules.We’ll explain the goal of each one, and then the picture will become clearer...

Item module

The model tries to predict the probability that an item will be clicked, i.e —the CTR (Click Through Rate). To do so, we have a module that gets as input theitem’s features such as its title and thumbnail, and outputs a denserepresentation — a vector of numbers if you will.

Once the model is trained, this vector will contain the important informationextracted out of the item.

Context module

We said the model predicts the probability of a click on an item, right? But inwhich context is the item shown?

Context can mean many things — the publisher, the user, the time of day, etc.This module gets as input the features of the context. It then outputs the denserepresentation of the context.

Fusion module

So we have the information extracted out of both the item and the context. Forsure, there’s some interaction between the two. For instance, an item aboutsoccer probably will have higher CTR in a sports publisher compared to a financepublisher.

This module fuses the two representations into one, in a similar fashion tocollaborative filtering.

Estimation module

At the end we have a module whose goal is to predict the CTR. In addition, italso estimates uncertainty about the CTR estimation.

I guess you’re mostly uncertain about how this module works, so let’s shedsome light on it.

We’ll walk you through the three types of uncertainty we’ve mentioned, and showyou how each one is handled by our model. First, let’s tackle the datauncertainty.

Data uncertainty

Let’s take some generic neural network trained on a regression task. One commonloss function is MSE — Mean Squared Error. We like this loss because it’sintuitive, right? You want to minimize the errors... But it turns out that whenyou minimize MSE, you implicitly maximize the likelihood of the data — assumingthe label is distributed normally with a fixed standard deviation \(\sigma\). This \(\sigma\) isthe noise inherent in the data.

One thing we can do is explicitly maximize the likelihood by introducing a newnode which we’ll call \(\sigma\). Plugging it into the likelihood equation and lettingthe gradients propagate enable this node to learn to output the data noise.

We didn’t achieve anything different, right? We got an equivalent result to theinitial MSE based model. However, now we can introduce a link from the lastlayer to \(\sigma\):

Now we’re getting into something interesting! \(\sigma\) is now a function of the input.It means the model can learn to associate different levels of data uncertaintywith different inputs.

We can make the model even more powerful. Instead of estimating a Gaussiandistribution, we can estimate a mixture of Gaussians. The more Gaussians we putinto the mix, the more capacity the model will have — and more prone tooverfitting, so be careful with that.

This architecture is called MDN — Mixture Density Network. It was introduced byBishop et al. in1994.Here is an example of what it captures:

We have two groups of similar items — one about shopping, and the other aboutsports.

It turns out the shopping items tend to have more variable CTR — maybe due totrendiness. Indeed, if we ask the model to estimate the uncertainty of one itemin each group (the dotted graph in the figure), we get higher uncertainty forshopping compared to sports.

So data uncertainty is behind us. What’s next?

Measurement uncertainty

This one is a bit more tricky. In the firstpost weexplained that sometimes measurement can be noisy. This might result in noisyfeatures or even noisy labels. In our case, our label \(y\) is the empiric CTR of anitem — the number of times it was clicked so far, divided by the number of timesit was shown.

Let’s say the true CTR of an item is \(y^*\) — that is, without measurement noise.This would be the CTR had we shown the item infinite amount of times in thecontext. But time is finite (at least the time we’ve got), so we showed it onlya finite amount of times. We measured an observed CTR \(y\). This \(y\) has measurementnoise — which we denote by \(\epsilon\).

Next, we assume \(\epsilon\) is distributed normally with a \(\sigma_\epsilon\) as the standarddeviation. \(\sigma_\epsilon\) function of \(r\) — the number of times we showed the item. Thebigger \(r\) is, the smaller \(\sigma_\epsilon\) gets, which makes \(y\) more similar to \(y^*\).

At the end of the day, after we spare you from the mathematical details (whichyou can find in our paper), we get thislikelihood equation:

This is the same as the likelihood we have in the MDN architecture of a mixtureof Gaussians, with one difference — the error term is split to two:

Now that the model is able to explain each uncertainty using a different term,the data uncertainty is not polluted by the measurement uncertainty.

Besides being able the explain the data in a better way, this allows us to usemore data in the training process. This is due to the fact that prior to thiswork we filtered out data with too much noise.

Last but not least

In a previous post we discussed how to handle model uncertainty. One of theapproaches we described was using dropout at inference time.

Being able to estimate model uncertainty allows us to understand better what themodel doesn’t know because of lack of data. So let’s put it to the test!

Let’s see if unique titles are associated with high uncertainty. We’ll map eachtitle in the training set to a dense representation (e.g. average word2vecembeddings) and expect the model to be less certain about unique titles — titlesthat are mapped to sparse regions of the embedding space.

To test it, we calculated the sparse and dense regions by calculating KDE(Kernel Density Estimation). This is a method for estimating the PDF(Probability Density Function) of our space. Next, we asked the model toestimate the uncertainty associated with each title. It turns out that indeedthe model has higher uncertainty in the sparse regions!

Nice... What would happen if we show the model more titles from the sparseregions? Will it be more certain about these regions? Let’s test it out!

We took a bunch of similar titles about cars and removed them from the trainingset. Effectively, it altered their region in the space from dense to sparse.Next, we estimated the model uncertainty over these titles. As expected, theuncertainty was high.

Finally, we added only one of the titles to the training set and retrained themodel. To our satisfaction, now the uncertainty has reduced for all of theseitems. Neat!

As we saw in the post aboutexploration-exploitation,we can encourage exploration of these sparse regions. After doing so, theuncertainty will decrease. This will result in a natural decay of exploration ofthat region.

Final thoughts

In this post we elaborated on how we model all three types of uncertainty —data, model and measurement — in a principled way, using one unified model.

We encourage you to think how you can use uncertainty in your application aswell! Even it you don’t need to explicitly model uncertainty in your prediction,you might benefit from using it in the training process — if your model canunderstand better how data is generated and how uncertainty affects the game, itmight be able to improve.


This is the forth post of a series related to a paper we're presenting in a workshop in this year KDD conference: deep density networks and uncertainty in recommender systems.

The first post can be found here.

The second post can be found here.

The third post can be found here.


Originally published atengineering.taboola.comby me and Inbar Naor.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

推荐系统 不确定性 神经网络 CTR Taboola
相关文章