This is a joint post with Inbar Naor. Originally published at engineering.taboola.com.
Now that we know what uncertaintytypes existand learned some ways to modelthem, wecan start talking about how to use them in our application.
In this post we’ll introduce the exploration-exploitation problem and show youhow uncertainty can help in solving it. We’ll focus on exploration inrecommender systems, but the same idea can be applied in many applications ofreinforcement learning — self driving cars, robots, etc.
Problem Setting
The goal of a recommender system is to recommend items that the users might findrelevant. At Taboola, relevance is expressed via a click: we show a widgetcontaining content recommendations, and the users choose if they want to clickon one of the items.
The probability of the user clicking on an item is called Click Through Rate(CTR). If we knew the CTR of all the items, the problem of which items torecommend would be easy: simply recommend the items with the highest CTR.
The problem is that we don’t know what the CTR is. We have a model thatestimates it, but it’s obviously not perfect. Some of the reasons for theimperfection are the uncertainty types inherent in recommender systems, which wediscussed in the first post of theseries.
The Exploitation vs Exploration Tradeoff
So now we’re facing a challenging situation — one that we’re all familiar withfrom our day-to-day lives: imagine you’ve just entered an ice cream shop. Younow face a crucial decision — out of about 30 flavors you need to choose onlyone!
You can go with two strategies: either go with that favorite flavor of yoursthat you already know is the best; or explore new flavors you never triedbefore, and maybe find a new best flavor.
These two strategies — exploitation and exploration — can also be used whenrecommending content. We can either exploit items that have high CTR with highcertainty — maybe because these items have been shown thousands of times tosimilar users; or we can explore new items we haven’t shown to many users in thepast. Incorporating exploration into your recommendation strategy is crucial —without it new items don’t stand a chance against older, more familiar ones.
Let’s explore exploration approaches
The easiest exploration-exploitation approach you can implement is the ϵ-greedyalgorithm, where you allocate ϵ percents of the traffic to explore new items ina random manner. The rest of the traffic is reserved for exploitation.
Despite not being optimal, this method is easy to understand. It can serve as asolid baseline for more sophisticated approaches. So how can we look for gooditems in a wiser manner?
looking for good items in a wise manner
A more advanced approach — Upper Confidence Bound (UCB) — uses uncertainty.Each item is associated with its expected CTR, and confidence bound over thatCTR. The confidence bound captures how uncertain we are about the item’s CTR.The vanilla UCB algorithm keeps track of the expected CTR and confidence boundby using empirical information alone: for each item we keep track of theempirical CTR (what percent of similar users have clicked on it), and theconfidence bound is calculated by assuming a binomial distribution.
Take for example the plain chocolate flavor you always order. You know it’s good— you give it 8 stars out of 10. Today a new flavor has arrived. You have noempiric information about it, which means it can be anything from 1 to 10 stars.Using this confidence interval, if you would want to explore you’d go with thenew flavor, since there’s the chance it’ll be a 10 stars flavor.
That strategy is exactly what UCB is all about — you choose the item with thehighest upper confidence bound value — in our case confidence bound over CTRestimation. The motivation behind this strategy is that over time the empiricCTR will tend towards the true CTR, and the confidence bound will shrink to 0.After enough time, we’ll explore everything.
Another popular approach is the Thompson Sampling method. In this approach weuse the entire estimated distribution over the item’s CTR instead of only aconfidence bound. For each item we sample a CTR out of its distribution.
These approaches might work well when the number of available items is fixed.Unfortunately, at Taboola every day thousands of new items enter the system, andothers become obsolete. By the time we get a reasonable confidence bound for anitem it might leave the system. Our efforts would have been in vain. It’s likedoing a world tour, each day visiting a new town with a vast amount of ice creamflavors to explore. The horror!
We need an approach that can estimate the CTR of a new item without showing iteven once. We need some food critic magazine that will guide us through thebuffet of content recommendations.
Consider a new type of chocolate flavor that just arrived. Since you know youlove chocolate you have a pretty good guess that you’ll like the new flavor too.In the vanilla UCB approach (no, that’s not a name of a flavor), you won’t beable to infer it — you rely on empirical information only.
In a future post we’ll elaborate on how we use a neural network to estimate theCTR of a new item, as well as the level of uncertainty. Using this uncertainty,we can apply the UCB approach in order to explore new items. Unlike the vanillaUCB that relies on empirical data, here we can use the model’s estimation toavoid showing items with low CTR. We can gamble on the horses we think willwin.
?
Online metrics and results
How can we know how well we explore new items? We need some explorationthroughput metric. In Taboola we have A/B testing infrastructure supporting manymodels running on different shares of traffic.
Back to ice cream! Let’s say you brought your friends to help you explore thedifferent flavors. Obviously if one of your friends randomly picks flavors, hehas the best exploration throughput, but not the smartest. The other friend thatorders the flavor the others have found tasty enjoys the most, but contributesnothing to the exploration effort.
At Taboola we measure exploration throughput as follows: for each item that hasbeen shown enough times, and in enough different contexts (e.g — different websites) we declare that item to have crossed the exploration phase. Next, weanalyze which models contributed to this successful effort. In order to count, amodel has to show that item enough times.
Using this perspective, the throughput of a model is defined to be the number ofitems it contributed to.
Using this metric, we were able to assert that indeed showing items randomlyyields the best throughput, but with a tendency for bad items. A model thatdoesn’t use the UCB approach shows good items, but has worse throughput. Themodel with UCB is somewhere between in terms of throughput, and shows onlyslightly worse items compared to the non-UCB model.
Hence, we conclude our UCB model has a good tradeoff between exploring newitems, and choosing the good items. We believe this tradeoff is worthwhile inthe long term.
Final thoughts
The exploration-exploitation problem is an exciting challenge for many companiesin the recommender systems domain. We hope our advancements will serve others intheir journey of providing the best service to their users. We believe this is asmall step in a big journey yet to be fulfilled, and we’re intrigued by thethought of what shape this field will take in the following years.
In the next post of the series we’ll elaborate on the model we used forestimating CTR and uncertainty, so stay tuned.
This is the third post of a series related to a paper we’re presenting in aworkshop in this year KDD conference: deep density networks and uncertainty inrecommender systems.
The first post can be foundhere.
The second post can be foundhere.