Demystifying Noise Contrastive Estimation

January 21, 2022

Introduction

This document describes two machine learning methods, Noise Contrastive Estimation (NCE) [5] and its follow-up InfoNCE [15]. We discuss two variants of NCE as well as InfoNCE and a related technique called partition function estimation. NCE-based methods are used for estimating the parameters of a statistical distribution by differentiating between “real data” and “noise”.

We will try to keep our terminology consistent. The original NCE method is sometimes referred to as Local NCE [3,13] or Binary NCE . We’ll call it Local NCE. The follow-up InfoNCE is very similar to an NCE variant called both Global NCE [13] and Ranking NCE [8], which we’ll call Global NCE.

Applications. Local NCE and Global NCE are both computationally inexpensive methods for learning a conditional likelihood $p_\theta(x \mid c)$. NCE is useful in general when the number of possible $x$’s is very large, like in language modeling [6,8], where $|X|$ is the number of words in a vocabulary. InfoNCE is a method for maximizing the mutual information between two variables, which forms the foundation of joint text-to-image learning methods like CLIP [11], where $x$ is an image and $c$ the text of its caption, as well as general contrastive learning-based techniques like SimCLR [1], where $x$ and $c$ are two different ‘views’ of the same image.

Overview. We begin by discussing how NCE works to approximate $p(x \mid c)$. We then consider how we could approximate $p(x \mid c)$ via importance sampling, a method sometimes known as partition function estimation. Then we connect partition function estimation to Global NCE and InfoNCE. We end with discussion of the differences between the methods and suggestions for which to use in practice.

What are x and c?

Local NCE and Global NCE focus on learning $p(x \mid c)$, the likelihood of some $x$ given some ‘context’ $c$. InfoNCE maximizes $\frac{p(x \mid c)}{p(x)}$, a proxy for mutual information between $x$ and $c$. But what are $x$ and $c$ anyway? Here are examples of x’s and c’s from the literature:

In NLP: Local NCE and Global NCE are both used for language modeling, where $x$ is a word and $c$ is the word’s context (the other words in a window around $x$). Here, we want to learn $p(x \mid c)$, the probability of the next word given context. [6]

In speech recognition: Local NCE is used for learning $p(x \mid c)$ where $x$ is a word predicted from audio $c$ in which the word was spoken. [18]

... continue reading