VRDS, an effective method to reduce the variance of Data Shapley, based on a stratified sampling technique with precomputed coefficients.In two of our previous paper pills about Data Valuation methods, namely Beta-Shapley and Data Banzhaf, we discussed how to make thecomputed values more robust to noise in the utility evaluation.Whereas [Kwo22B] and [Wan23D] addressthe noise in stochastic utilities, in the paper we study here, [Wu23V], the authors improve the quality of the values bytackling a different problem. They reduce the variance of the Monte Carloestimate using stratified sampling, similarly to previous work in [Mal14B] and [Cas17I]. They call theirmethod Variance Reduce Data Shapley (VRDS).Starting with the combinatorial formulation of Shapley values, one can dividethe coalitions into strata according to their size when calculating the marginalcontributions:$$\begin{align}v(xi)&= \frac{1}{n} \sum{S \subseteq D \setminus{x_i}, |S| = k} \binom{n-1}{k}^{-1} \left[u(S \cup {xi}) − u(S)\right]\&= \frac{1}{n} \sum{k=0}^{n-1} \binom{n-1}{k}^{-1} \sum_{S \subseteq D \setminus{x_i}, |S| = k} \left[u(S \cup {x_i}) − u(S)\right].\end{align}$$Figure 1: Example of stratified samplingfor the set ${1, 2, 3, 4}.$Grouping by size the coalitions which do not contain the data point $i,$we have $n$ strata $S_0, S1, \dots , S{n−1}$ such that$S_k = {S \subseteq N \ {i}, |S| = k}$ contains all the coalitions with size $k.$The marginal contribution of data point $i$ within stratum $S_k$ is:$$v^k(xi) = \binom{n-1}{k}^{-1} \sum{S \subseteq D \setminus{x_i}, |S| = k} \left[u(S \cup {x_i}) − u(S)\right],$$and with it, the Shapley value of point $i$ can be written as:$$v(xi) = \frac{1}{n} \sum{k=0}^{n-1} v^k(x_i).$$A standard Monte Carlo approximation (to reduce the number of evaluations of $u$) isthen:$$\hat{v}(xi)= \frac{1}{n} \sum{k=0}^{n-1} \hat{v}^k(xi)= \frac{1}{n} \sum{k=0}^{n-1} \frac{1}{mk} \sum{j=1}^{m_k}\left[u(S^k_j \cup {x_i}) − u(S^k_j)\right],$$where $S^k_j$ is the $j$th sample coalition in stratum $k$ and $m_k$is the number of samples in the stratum. Grouping coalitions by size intostrata allows us to optimise the number of samples taken from each stratum.In Theorem 4.1 of [Wu23V], the authors show that thechoice of $mk$ which optimally reduces variance for a given number of samples$m$ is proportional to $\sigma{i,k},$ the standard deviation of$\hat{v}^k(x_i)$:$$\tilde{m}k = m\frac{\sigma{i,k}}{\sum{j=0}^{n−1} \sigma{i,j}},\quad k = 0, \dots , n − 1.$$Because the variance of the estimators $\hat{v}^k(xi)$ is unknown, the authorssuggest substituting a general $f(k)$ for $\sigma{i,k}$ and propose a fewdifferent possibilities. They pick$$f(k) = (k+1)^a, \quad a < 0,$$a choice that, when plugged into the formula above, is a generalization of theoptimal value obtained in [Mal14B]. 11 Contrary to VRDS, this is derived from aHoeffding bound on the quality of the estimator, after solving a simpleminimization problem, leading to a specific choice ofexponent $a =-\frac{2}{3},$ cf [Mal14B] Eq. (14).Taking into account that sample size should be an integer, we can set the valueof $m_k = \min(1, \lfloor \tilde{m}k \rfloor).$ However, this implies thatsome samples may be left unused as $\sum{k=0}^{n-1} m_k \le m.$ In that case,the authors sequentially increase the value of $m_k$ from $k = 0$ to $n − 1$until the sum exceeds $m.$We note that previously, [Cas17I] used estimatedstandard deviations $\tilde{\sigma}{i,k}$ instead:$$\tilde{m}{i,k} = m \frac{\tilde{\sigma}{i,k}^2}{\sum{j=0}^{n-1} \sum{l=0}^{n-1} \tilde{\sigma}{j,l}^2},\quad k = 0, \dots , n − 1.$$Sadly, all of these approaches assume that the number of samplesis known beforehand which is almost never the case in practice.To empirically demonstrate that the stratified sampling algorithmfor VRDS can estimate the exact data Shapley values with a variance smallerthan that obtained by the permutation sampling algorithm,the authors then run a few different experiments on a plethora of datasets:Comparison of variance of VRDS and Permutation sampling:Figure 2: Data Shapley valuesof 20 randomly selected data points from the FashionMNIST datasetand their estimates. The experiment is repeated 30 times to estimatevalues.From Figure 2, we can clearly see that VRDS’svariance is lower than Permutation sampling. However, in many data valuationapplications we are more interested in the order or ranking thanin the actual value, and it is not clear whether the rank of VRDS valuesremains truly stable across experiments.The authors should have added an experiment to evaluate the ranking stabilityof VRDS, similar to what has been done inData Banzhaf [Wan23D].Performance of VRDS using different exponents $a:$Figure 3: Variance of VRDS when $a$ takesdifferent values in the range $[−3, 3].$ The dataset is FashionMNIST,the models are KNN and LR and the data sizes are 100 and 30, respectively.Figure 3 shows that VRDS consistently achieves lower variancethan permutation sampling, and that $a=-1$ (black, bottom line) seems to yieldthe best result. It would have been interesting to see how the choice in[Mal14B] of $a=-\frac{2}{3}$ would have performed.Removal of best and worst data groups:Figure 4: Data group removal experiment on theFashionMNIST data set. There are 20 groups and each group has 100 data points.VRDS ($a = −\frac{1}{2}$) and permutation sampling algorithms are used tocalculate the value of data groups. (a) shows the removal of the most valuabledata groups according to the estimated data Shapley values. (b) shows theremoval of the most valuable data groups where the estimated data Shapley values$v_i$ have been shifted by the variances $\sigma^2_i$ with $−100 \sigma^2_i.$The plots in Figure 4 don’t clearly show that VRDS performsbetter than permutation sampling for point removal tasks, and, crucially, lackconfidence intervals. The similar performance could be due to the fact that eventhough VRDS’ variance is lower it doesn’t necessarily converge to correctvalues, which may indicate that the estimation is biased. It would also havebeen beneficial to compare it to other methods such as Data Banzhaf.Alas, only the first experiment includes confidence intervals. Without moredetails and a study of the quality of the approximation, possibly using rankstability statistics, it is hard to decide on the interest of VRDS in practice.The experiments in general lack details and no code was provided to reproducetheir results.However, the use of stratified sampling seems like a worthwhile additionto the data Shapley methods as the authors wrote in their conclusion: it wouldbe interesting to apply it to other data valuations methods that relyon sampling to estimate their values (e.g. Beta Shapley, Data Banzhaf,Truncated Monte Carlo Shapley).If you are interested in testing out data valuation algorithms considerusing the TransferLab’s pyDVL package for datavaluation. To try it out, install with pip install pydvl andread the documentation.