Counterfactual evaluation for recommendation systems

When I first started working on recommendation systems, I thought there was something weird about the way we did offline evaluation. First, we split customer interaction data into training and validation sets. Then, we train our recommenders on the training set before evaluating them on the validation set, usually on metrics such as recall, precision, and NDCG. This is similar to how we evaluate supervised machine learning models and doesn’t seem unusual at first glance.

But don’t our recommendations change how customers click or purchase? If customers can only interact with items shown to them, why do we perform offline evaluation on static historical data?

Observational vs. interventional problems

It took me a while to put a finger on it but I think this is why it felt weird: We’re treating recommendations as an observational problem when it really is an interventional problem.

Problems solved via supervised machine learning are usually observational problems. Given an observation such as product title, description, and image, we try to predict the product category. Our model learns P(category=phone|title=“…”, description=“…”, image=image01.jpeg).

On the other hand, recommendations are an interventional problem. We want to learn how different interventions (i.e., item recommendations) lead to different outcomes (i.e., clicks, purchases). By using logged customer interaction data as labels, the observational offline evaluation approach ignores the interventional nature of recommendations.

As a result, we’re not evaluating if users would click or purchase more due to our new recommendations; we’re evaluating how well the new recommendations fit logged data. Thus, what our model learns is P(view3=iphone|view1=pixel, view2=galaxy) when what we really want is P(click=True|recommend=iphone, view1=pixel, view2=galaxy).

Evaluating recsys as an interventional problem

The straightforward way to evaluate recommendations as an interventional problem is via A/B testing. Our interventions (i.e., new recommendations) are shown to users, we log their behavior attributed to our new recommendations, and then measure how metrics such as click-thru-rate and conversion change. However, it requires more effort relative to offline evaluation, experiment cycles may be long as we need enough data to make a judgement, and there’s the risk of deploying terrible experiments. Also, we may not have easy access to A/B testing we’re working on the research side of things.

The less direct approach is counterfactual evaluation. Counterfactual evaluation tries to answer “what would have happened if we show users our new recommendations instead of the existing recommendations?” This allows us to estimate the outcomes of potential A/B tests without actually running them.

... continue reading