chainer.functions.negative_sampling¶

chainer.functions.negative_sampling(x, t, W, sampler, sample_size, reduce='sum')[source]¶

Negative sampling loss function.

In natural language processing, especially language modeling, the number of words in a vocabulary can be very large. Therefore, you need to spend a lot of time calculating the gradient of the embedding matrix.

By using the negative sampling trick you only need to calculate the gradient for a few sampled negative examples.

The loss is defined as follows.

\[f(x, p) = - \log \sigma(x^\top w_p) - \ k E_{i \sim P(i)}[\log \sigma(- x^\top w_i)]\]

where \(\sigma(\cdot)\) is a sigmoid function, \(w_i\) is the weight vector for the word \(i\), and \(p\) is a positive example. It is approximated with \(k\) examples \(N\) sampled from probability \(P(i)\).

\[f(x, p) \approx - \log \sigma(x^\top w_p) - \ \sum_{n \in N} \log \sigma(-x^\top w_n)\]

Each sample of \(N\) is drawn from the word distribution \(P(w) = \frac{1}{Z} c(w)^\alpha\), where \(c(w)\) is the unigram count of the word \(w\), \(\alpha\) is a hyper-parameter, and \(Z\) is the normalization constant.

Parameters:	x (Variable) – Batch of input vectors. t (Variable) – Vector of ground truth labels. W (Variable) – Weight matrix. sampler (FunctionType) – Sampling function. It takes a shape and returns an integer array of the shape. Each element of this array is a sample from the word distribution. A `WalkerAlias` object built with the power distribution of word frequency is recommended. sample_size (int) – Number of samples. reduce (str) – Reduction option. Its value must be either `'sum'` or `'no'`. Otherwise, `ValueError` is raised.
Returns:	A variable holding the loss value(s) calculated by the above equation. If `reduce` is `'no'`, the output variable holds array whose shape is same as one of (hence both of) input variables. If it is `'sum'`, the output variable holds a scalar value.
Return type:	Variable

See: Distributed Representations of Words and Phrases and their Compositionality