chainer.functions.negative_sampling¶
-
chainer.functions.
negative_sampling
(x, t, W, sampler, sample_size, reduce='sum')[source]¶ Negative sampling loss function.
In natural language processing, especially language modeling, the number of words in a vocabulary can be very large. Therefore, you need to spend a lot of time calculating the gradient of the embedding matrix.
By using the negative sampling trick you only need to calculate the gradient for a few sampled negative examples.
The loss is defined as follows.
\[f(x, p) = - \log \sigma(x^\top w_p) - \ k E_{i \sim P(i)}[\log \sigma(- x^\top w_i)]\]where \(\sigma(\cdot)\) is a sigmoid function, \(w_i\) is the weight vector for the word \(i\), and \(p\) is a positive example. It is approximated with \(k\) examples \(N\) sampled from probability \(P(i)\).
\[f(x, p) \approx - \log \sigma(x^\top w_p) - \ \sum_{n \in N} \log \sigma(-x^\top w_n)\]Each sample of \(N\) is drawn from the word distribution \(P(w) = \frac{1}{Z} c(w)^\alpha\), where \(c(w)\) is the unigram count of the word \(w\), \(\alpha\) is a hyper-parameter, and \(Z\) is the normalization constant.
Parameters: - x (Variable) – Batch of input vectors.
- t (Variable) – Vector of ground truth labels.
- W (Variable) – Weight matrix.
- sampler (FunctionType) – Sampling function. It takes a shape and
returns an integer array of the shape. Each element of this array
is a sample from the word distribution.
A
WalkerAlias
object built with the power distribution of word frequency is recommended. - sample_size (int) – Number of samples.
- reduce (str) – Reduction option. Its value must be either
'sum'
or'no'
. Otherwise,ValueError
is raised.
Returns: A variable holding the loss value(s) calculated by the above equation. If
reduce
is'no'
, the output variable holds array whose shape is same as one of (hence both of) input variables. If it is'sum'
, the output variable holds a scalar value.Return type: See: Distributed Representations of Words and Phrases and their Compositionality
See also