This is a joint post with Inbar Naor. Originally published at engineering.taboola.com.
In the firstpost of theseries we discussed three types of uncertainty that can affect your model — datauncertainty, model uncertainty and measurement uncertainty. In the secondpost wetalked about various methods to handle the model uncertainty specifically.Then, in our thirdpostwe showed how we can use the model’s uncertainty to encourage exploration of newitems in recommender systems.
Wouldn’t it be great if we can handle all three types of uncertainty in aprincipled way using one unified model? In this post we’ll show you how we atTaboola implemented a neural network that estimates both the probability of anitem being relevant to the user, as well as the uncertainty of this prediction.
Let’s jump to the deep water
A picture is worth a thousand words, isn’t it? And a picture containing athousand neurons?...
In any case, this is the model we use. The model is composed of several modules.We’ll explain the goal of each one, and then the picture will become clearer...
Item module
The model tries to predict the probability that an item will be clicked, i.e —the CTR (Click Through Rate). To do so, we have a module that gets as input theitem’s features such as its title and thumbnail, and outputs a denserepresentation — a vector of numbers if you will.
Once the model is trained, this vector will contain the important informationextracted out of the item.
Context module
We said the model predicts the probability of a click on an item, right? But inwhich context is the item shown?
Context can mean many things — the publisher, the user, the time of day, etc.This module gets as input the features of the context. It then outputs the denserepresentation of the context.
Fusion module
So we have the information extracted out of both the item and the context. Forsure, there’s some interaction between the two. For instance, an item aboutsoccer probably will have higher CTR in a sports publisher compared to a financepublisher.
This module fuses the two representations into one, in a similar fashion tocollaborative filtering.
Estimation module
At the end we have a module whose goal is to predict the CTR. In addition, italso estimates uncertainty about the CTR estimation.
I guess you’re mostly uncertain about how this module works, so let’s shedsome light on it.
We’ll walk you through the three types of uncertainty we’ve mentioned, and showyou how each one is handled by our model. First, let’s tackle the datauncertainty.
Data uncertainty
Let’s take some generic neural network trained on a regression task. One commonloss function is MSE — Mean Squared Error. We like this loss because it’sintuitive, right? You want to minimize the errors... But it turns out that whenyou minimize MSE, you implicitly maximize the likelihood of the data — assumingthe label is distributed normally with a fixed standard deviation \(\sigma\). This \(\sigma\) isthe noise inherent in the data.
One thing we can do is explicitly maximize the likelihood by introducing a newnode which we’ll call \(\sigma\). Plugging it into the likelihood equation and lettingthe gradients propagate enable this node to learn to output the data noise.
We didn’t achieve anything different, right? We got an equivalent result to theinitial MSE based model. However, now we can introduce a link from the lastlayer to \(\sigma\):
Now we’re getting into something interesting! \(\sigma\) is now a function of the input.It means the model can learn to associate different levels of data uncertaintywith different inputs.
We can make the model even more powerful. Instead of estimating a Gaussiandistribution, we can estimate a mixture of Gaussians. The more Gaussians we putinto the mix, the more capacity the model will have — and more prone tooverfitting, so be careful with that.
This architecture is called MDN — Mixture Density Network. It was introduced byBishop et al. in1994.Here is an example of what it captures:
We have two groups of similar items — one about shopping, and the other aboutsports.
It turns out the shopping items tend to have more variable CTR — maybe due totrendiness. Indeed, if we ask the model to estimate the uncertainty of one itemin each group (the dotted graph in the figure), we get higher uncertainty forshopping compared to sports.
So data uncertainty is behind us. What’s next?
Measurement uncertainty
This one is a bit more tricky. In the firstpost weexplained that sometimes measurement can be noisy. This might result in noisyfeatures or even noisy labels. In our case, our label \(y\) is the empiric CTR of anitem — the number of times it was clicked so far, divided by the number of timesit was shown.
Let’s say the true CTR of an item is \(y^*\) — that is, without measurement noise.This would be the CTR had we shown the item infinite amount of times in thecontext. But time is finite (at least the time we’ve got), so we showed it onlya finite amount of times. We measured an observed CTR \(y\). This \(y\) has measurementnoise — which we denote by \(\epsilon\).
Next, we assume \(\epsilon\) is distributed normally with a \(\sigma_\epsilon\) as the standarddeviation. \(\sigma_\epsilon\) function of \(r\) — the number of times we showed the item. Thebigger \(r\) is, the smaller \(\sigma_\epsilon\) gets, which makes \(y\) more similar to \(y^*\).
At the end of the day, after we spare you from the mathematical details (whichyou can find in our paper), we get thislikelihood equation:
This is the same as the likelihood we have in the MDN architecture of a mixtureof Gaussians, with one difference — the error term is split to two:
- data uncertainty (\(\sigma_i\))measurement uncertainty (\(\sigma_\epsilon\))
Now that the model is able to explain each uncertainty using a different term,the data uncertainty is not polluted by the measurement uncertainty.
Besides being able the explain the data in a better way, this allows us to usemore data in the training process. This is due to the fact that prior to thiswork we filtered out data with too much noise.
Last but not least
In a previous post we discussed how to handle model uncertainty. One of theapproaches we described was using dropout at inference time.
Being able to estimate model uncertainty allows us to understand better what themodel doesn’t know because of lack of data. So let’s put it to the test!
Let’s see if unique titles are associated with high uncertainty. We’ll map eachtitle in the training set to a dense representation (e.g. average word2vecembeddings) and expect the model to be less certain about unique titles — titlesthat are mapped to sparse regions of the embedding space.
To test it, we calculated the sparse and dense regions by calculating KDE(Kernel Density Estimation). This is a method for estimating the PDF(Probability Density Function) of our space. Next, we asked the model toestimate the uncertainty associated with each title. It turns out that indeedthe model has higher uncertainty in the sparse regions!
Nice... What would happen if we show the model more titles from the sparseregions? Will it be more certain about these regions? Let’s test it out!
We took a bunch of similar titles about cars and removed them from the trainingset. Effectively, it altered their region in the space from dense to sparse.Next, we estimated the model uncertainty over these titles. As expected, theuncertainty was high.
Finally, we added only one of the titles to the training set and retrained themodel. To our satisfaction, now the uncertainty has reduced for all of theseitems. Neat!
As we saw in the post aboutexploration-exploitation,we can encourage exploration of these sparse regions. After doing so, theuncertainty will decrease. This will result in a natural decay of exploration ofthat region.
Final thoughts
In this post we elaborated on how we model all three types of uncertainty —data, model and measurement — in a principled way, using one unified model.
We encourage you to think how you can use uncertainty in your application aswell! Even it you don’t need to explicitly model uncertainty in your prediction,you might benefit from using it in the training process — if your model canunderstand better how data is generated and how uncertainty affects the game, itmight be able to improve.
This is the forth post of a series related to a paper we're presenting in a workshop in this year KDD conference: deep density networks and uncertainty in recommender systems.
The first post can be found here.
The second post can be found here.
The third post can be found here.
Originally published atengineering.taboola.comby me and Inbar Naor.