Standard Link implementations¶
Chainer provides many Link
implementations in the
chainer.links
package.
Note
Some of the links are originally defined in the chainer.functions
namespace. They are still left in the namespace for backward compatibility,
though it is strongly recommended to use them via the chainer.links
package.
Learnable connections¶
Bias¶

class
chainer.links.
Bias
(axis=1, shape=None)[source]¶ Broadcasted elementwise summation with learnable parameters.
Computes a elementwise summation as
bias()
function does except that its second input is a learnable bias parameter \(b\) the link has.Parameters:  axis (int) – The first axis of the first input of
bias()
function along which its second input is applied.  shape (tuple of ints) – Shape of the learnable bias parameter. If
None
, this link does not have learnable parameters so an explicit bias needs to be given to its__call__
method’s second input.
See also
See
bias()
for details.Variables: b (Variable) – Bias parameter if shape
is given. Otherwise, no attributes. axis (int) – The first axis of the first input of
Bilinear¶

class
chainer.links.
Bilinear
(left_size, right_size, out_size, nobias=False, initialW=None, initial_bias=None)[source]¶ Bilinear layer that performs tensor multiplication.
Bilinear is a primitive link that wraps the
bilinear()
functions. It holds parametersW
,V1
,V2
, andb
corresponding to the arguments ofbilinear()
.Parameters:  left_size (int) – Dimension of input vector \(e^1\) (\(J\))
 right_size (int) – Dimension of input vector \(e^2\) (\(K\))
 out_size (int) – Dimension of output vector \(y\) (\(L\))
 nobias (bool) – If
True
, parametersV1
,V2
, andb
are omitted.  initialW (3D numpy array) – Initial value of \(W\).
Shape of this argument must be
(left_size, right_size, out_size)
. IfNone
, \(W\) is initialized by centered Gaussian distribution properly scaled according to the dimension of inputs and outputs. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (tuple) – Initial values of \(V^1\), \(V^2\)
and \(b\). The length this argument must be 3.
Each element of this tuple must have the shapes of
(left_size, output_size)
,(right_size, output_size)
, and(output_size,)
, respectively. IfNone
, \(V^1\) and \(V^2\) is initialized by scaled centered Gaussian distributions and \(b\) is set to \(0\). May also be a tuple of callables that takenumpy.ndarray
orcupy.ndarray
and edit its value.
See also
See
chainer.functions.bilinear()
for details.Variables:
Convolution2D¶

class
chainer.links.
Convolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, bias=0, nobias=False, use_cudnn=True, initialW=None, initial_bias=None)[source]¶ Twodimensional convolutional layer.
This link wraps the
convolution_2d()
function and holds the filter weight and bias vector as parameters.Parameters:  in_channels (int) – Number of channels of input arrays. If None, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.
 out_channels (int) – Number of channels of output arrays.
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  wscale (float) – Scaling factor of the initial weight.
 bias (float) – Initial bias value.
 nobias (bool) – If
True
, then this link does not use the bias term.  use_cudnn (bool) – If
True
, then this link uses cuDNN if available.  initialW (4D array) – Initial weight value. If
None
, then this function uses to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (1D array) – Initial bias value. If
None
, then this function uses to initializebias
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.
See also
See
chainer.functions.convolution_2d()
for the definition of twodimensional convolution.Variables:
ConvolutionND¶

class
chainer.links.
ConvolutionND
(ndim, in_channels, out_channels, ksize, stride=1, pad=0, initialW=None, initial_bias=None, use_cudnn=True, cover_all=False)[source]¶ Ndimensional convolution layer.
This link wraps the
convolution_nd()
function and holds the filter weight and bias vector as parameters.Parameters:  ndim (int) – Number of spatial dimensions.
 in_channels (int) – Number of channels of input arrays.
 out_channels (int) – Number of channels of output arrays.
 ksize (int or tuple of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k, ..., k)
are equivalent.  stride (int or tuple of ints) – Stride of filter application.
stride=s
andstride=(s, s, ..., s)
are equivalent.  pad (int or tuple of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p, ..., p)
are equivalent.  initialW – Value used to initialize the filter weight. May be an
initializer instance or another value that
init_weight()
helper function can take. This link usesinit_weight()
to initialize the filter weight and passes the value ofinitialW
to it as it is.  initial_bias – Value used to initialize the bias vector. May be an
initializer instance or another value except
None
thatinit_weight()
helper function can take. IfNone
is given, this link does not use the bias vector. This link usesinit_weight()
to initialize the bias vector and passes the value ofinitial_bias
other thanNone
to it as it is.  use_cudnn (bool) – If
True
, then this link uses cuDNN if available. Seeconvolution_nd()
for exact conditions of cuDNN availability.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels. It may make the output size larger.cover_all
needs to beFalse
if you want to use cuDNN.
See also
See
convolution_nd()
for the definition of Ndimensional convolution. Seeconvolution_2d()
for the definition of twodimensional convolution.Variables:
Deconvolution2D¶

class
chainer.links.
Deconvolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, bias=0, nobias=False, outsize=None, use_cudnn=True, initialW=None, initial_bias=None)[source]¶ Two dimensional deconvolution function.
This link wraps the
deconvolution_2d()
function and holds the filter weight and bias vector as parameters.Parameters:  in_channels (int) – Number of channels of input arrays.
 out_channels (int) – Number of channels of output arrays.
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  wscale (float) – Scaling factor of the initial weight.
 bias (float) – Initial bias value.
 nobias (bool) – If
True
, then this function does not use the bias term.  outsize (tuple) – Expected output size of deconvolutional operation.
It should be pair of height and width \((out_H, out_W)\).
Default value is
None
and the outsize is estimated by input size, stride and pad.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  initialW (4D array) – Initial weight value. If
None
, then this function uses to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (1D array) – Initial bias value. If
None
, then this function uses to initializebias
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.
The filter weight has four dimensions \((c_I, c_O, k_H, k_W)\) which indicate the number of the number of input channels, output channels, height and width of the kernels, respectively. The filter weight is initialized with i.i.d. Gaussian random samples, each of which has zero mean and deviation \(\sqrt{1/(c_I k_H k_W)}\) by default. The deviation is scaled by
wscale
if specified.The bias vector is of size \(c_O\). Its elements are initialized by
bias
argument. Ifnobias
argument is set to True, then this function does not hold the bias parameter.See also
See
chainer.functions.deconvolution_2d()
for the definition of twodimensional convolution.
EmbedID¶

class
chainer.links.
EmbedID
(in_size, out_size, initialW=None, ignore_label=None)[source]¶ Efficient linear layer for onehot input.
This is a link that wraps the
embed_id()
function. This link holds the ID (word) embedding matrixW
as a parameter.Parameters:  in_size (int) – Number of different identifiers (a.k.a. vocabulary size).
 out_size (int) – Size of embedding vector.
 initialW (2D array) – Initial weight value. If
None
, then the matrix is initialized from the standard normal distribution. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  ignore_label (int or None) – If
ignore_label
is an int value,i
th column of return value is filled with0
.
See also
Variables: W (Variable) – Embedding parameter matrix.
GRU¶

class
chainer.links.
GRU
(n_units, n_inputs=None, init=None, inner_init=None, bias_init=0)[source]¶ Stateless Gated Recurrent Unit function (GRU).
GRU function has six parameters \(W_r\), \(W_z\), \(W\), \(U_r\), \(U_z\), and \(U\). All these parameters are \(n \times n\) matrices, where \(n\) is the dimension of hidden vectors.
Given two inputs a previous hidden vector \(h\) and an input vector \(x\), GRU returns the next hidden vector \(h'\) defined as
\[\begin{split}r &=& \sigma(W_r x + U_r h), \\ z &=& \sigma(W_z x + U_z h), \\ \bar{h} &=& \tanh(W x + U (r \odot h)), \\ h' &=& (1  z) \odot h + z \odot \bar{h},\end{split}\]where \(\sigma\) is the sigmoid function, and \(\odot\) is the elementwise product.
GRU
does not hold the value of hidden vector \(h\). So this is stateless. UseStatefulGRU
as a stateful GRU.Parameters:  See:
 On the Properties of Neural Machine Translation: EncoderDecoder Approaches [Cho+, SSST2014].
 Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling [Chung+NIPS2014 DLWorkshop].
See also
Inception¶

class
chainer.links.
Inception
(in_channels, out1, proj3, out3, proj5, out5, proj_pool, conv_init=None, bias_init=None)[source]¶ Inception module of GoogLeNet.
It applies four different functions to the input array and concatenates their outputs along the channel dimension. Three of them are 2D convolutions of sizes 1x1, 3x3 and 5x5. Convolution paths of 3x3 and 5x5 sizes have 1x1 convolutions (called projections) ahead of them. The other path consists of 1x1 convolution (projection) and 3x3 max pooling.
The output array has the same spatial size as the input. In order to satisfy this, Inception module uses appropriate padding for each convolution and pooling.
See: Going Deeper with Convolutions.
Parameters:  in_channels (int) – Number of channels of input arrays.
 out1 (int) – Output size of 1x1 convolution path.
 proj3 (int) – Projection size of 3x3 convolution path.
 out3 (int) – Output size of 3x3 convolution path.
 proj5 (int) – Projection size of 5x5 convolution path.
 out5 (int) – Output size of 5x5 convolution path.
 proj_pool (int) – Projection size of max pooling path.
 conv_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the convolution matrix weights. Maybe beNone
to use default initialization.  bias_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the convolution bias weights. Maybe beNone
to use default initialization.

__call__
(x)[source]¶ Computes the output of the Inception module.
Parameters: x (Variable) – Input variable. Returns: Output variable. Its array has the same spatial size and the same minibatch size as the input array. The channel dimension has size out1 + out3 + out5 + proj_pool
.Return type: Variable
InceptionBN¶

class
chainer.links.
InceptionBN
(in_channels, out1, proj3, out3, proj33, out33, pooltype, proj_pool=None, stride=1, conv_init=None)[source]¶ Inception module of the new GoogLeNet with BatchNormalization.
This chain acts like
Inception
, while InceptionBN uses theBatchNormalization
on top of each convolution, the 5x5 convolution path is replaced by two consecutive 3x3 convolution applications, and the pooling method is configurable.See: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
Parameters:  in_channels (int) – Number of channels of input arrays.
 out1 (int) – Output size of the 1x1 convolution path.
 proj3 (int) – Projection size of the single 3x3 convolution path.
 out3 (int) – Output size of the single 3x3 convolution path.
 proj33 (int) – Projection size of the double 3x3 convolutions path.
 out33 (int) – Output size of the double 3x3 convolutions path.
 pooltype (str) – Pooling type. It must be either
'max'
or'avg'
.  proj_pool (bool) – If
True
, do projection in the pooling path.  stride (int) – Stride parameter of the last convolution of each path.
 conv_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the convolution matrix weights. Maybe beNone
to use default initialization.
See also
Variables: train (bool) – If True
, then batch normalization layers are used in training mode. IfFalse
, they are used in testing mode.
Linear¶

class
chainer.links.
Linear
(in_size, out_size, wscale=1, bias=0, nobias=False, initialW=None, initial_bias=None)[source]¶ Linear layer (a.k.a. fullyconnected layer).
This is a link that wraps the
linear()
function, and holds a weight matrixW
and optionally a bias vectorb
as parameters.The weight matrix
W
is initialized with i.i.d. Gaussian samples, each of which has zero mean and deviation \(\sqrt{1/\text{in_size}}\). The bias vectorb
is of sizeout_size
. Each element is initialized with thebias
value. Ifnobias
argument is set to True, then this link does not hold a bias vector.Parameters:  in_size (int) – Dimension of input vectors. If None, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.
 out_size (int) – Dimension of output vectors.
 wscale (float) – Scaling factor of the weight matrix.
 bias (float) – Initial bias value.
 nobias (bool) – If
True
, then this function does not use the bias.  initialW (2D array) – Initial weight value. If
None
, then this function uses to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (1D array) – Initial bias value. If
None
, then this function uses to initializebias
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.
See also
Variables:
LSTM¶

class
chainer.links.
LSTM
(in_size, out_size, **kwargs)[source]¶ Fullyconnected LSTM layer.
This is a fullyconnected LSTM layer as a chain. Unlike the
lstm()
function, which is defined as a stateless activation function, this chain holds upward and lateral connections as child links.It also maintains states, including the cell state and the output at the previous time step. Therefore, it can be used as a stateful LSTM.
This link supports variable length inputs. The minibatch size of the current input must be equal to or smaller than that of the previous one. The minibatch size of
c
andh
is determined as that of the first inputx
. When minibatch size ofi
th input is smaller than that of the previous input, this link only updatesc[0:len(x)]
andh[0:len(x)]
and doesn’t change the rest ofc
andh
. So, please sort input sequences in descending order of lengths before applying the function.Parameters:  in_size (int) – Dimensionality of input vectors.
 out_size (int) – Dimensionality of output vectors.
 lateral_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the lateral connections. Maybe beNone
to use default initialization.  upward_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the upward connections. Maybe beNone
to use default initialization.  bias_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value It is used for initialization of the biases of cell input, input gate and output gate.and gates of the upward connection. Maybe a scalar, in that case, the bias is initialized by this value. Maybe beNone
to use default initialization.  forget_bias_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value It is used for initialization of the biases of the forget gate of the upward connection. Maybe a scalar, in that case, the bias is initialized by this value. Maybe beNone
to use default initialization.
Variables:
MLPConvolution2D¶

class
chainer.links.
MLPConvolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, activation=<function relu>, use_cudnn=True, conv_init=None, bias_init=None)[source]¶ Twodimensional MLP convolution layer of Network in Network.
This is an “mlpconv” layer from the Network in Network paper. This layer is a twodimensional convolution layer followed by 1x1 convolution layers and interleaved activation functions.
Note that it does not apply the activation function to the output of the last 1x1 convolution layer.
Parameters:  in_channels (int) – Number of channels of input arrays.
 out_channels (tuple of ints) – Tuple of number of channels. The ith integer indicates the number of filters of the ith convolution.
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels) of the
first convolution layer.
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications at the
first convolution layer.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays at
the first convolution layer.
pad=p
andpad=(p, p)
are equivalent.  activation (function) – Activation function for internal hidden units. Note that this function is not applied to the output of this link.
 use_cudnn (bool) – If
True
, then this link uses cuDNN if available.  conv_init – An initializer of weight matrices passed to the convolution layers.
 bias_init – An initializer of bias vectors passed to the convolution layers.
See: Network in Network <http://arxiv.org/abs/1312.4400v3>.
Variables: activation (function) – Activation function.
Scale¶

class
chainer.links.
Scale
(axis=1, W_shape=None, bias_term=False, bias_shape=None)[source]¶ Broadcasted elementwise product with learnable parameters.
Computes a elementwise product as
scale()
function does except that its second input is a learnable weight parameter \(W\) the link has.Parameters:  axis (int) – The first axis of the first input of
scale()
function along which its second input is applied.  W_shape (tuple of ints) – Shape of learnable weight parameter. If
None
, this link does not have learnable weight parameter so an explicit weight needs to be given to its__call__
method’s second input.  bias_term (bool) – Whether to also learn a bias (equivalent to Scale link + Bias link).
 bias_shape (tuple of ints) – Shape of learnable bias. If
W_shape
isNone
, this should be given to determine the shape. Otherwise, the bias has the same shapeW_shape
with the weight parameter andbias_shape
is ignored.
See also
See
scale()
for details.Variables:  axis (int) – The first axis of the first input of
StatefulGRU¶

class
chainer.links.
StatefulGRU
(in_size, out_size, init=None, inner_init=None, bias_init=0)[source]¶ Stateful Gated Recurrent Unit function (GRU).
Stateful GRU function has six parameters \(W_r\), \(W_z\), \(W\), \(U_r\), \(U_z\), and \(U\). All these parameters are \(n \times n\) matrices, where \(n\) is the dimension of hidden vectors.
Given input vector \(x\), Stateful GRU returns the next hidden vector \(h'\) defined as
\[\begin{split}r &=& \sigma(W_r x + U_r h), \\ z &=& \sigma(W_z x + U_z h), \\ \bar{h} &=& \tanh(W x + U (r \odot h)), \\ h' &=& (1  z) \odot h + z \odot \bar{h},\end{split}\]where \(h\) is current hidden vector.
As the name indicates,
StatefulGRU
is stateful, meaning that it also holds the next hidden vector h’ as a state. UseGRU
as a stateless version of GRU.Parameters:  in_size (int) – Dimension of input vector \(x\).
 out_size (int) – Dimension of hidden vector \(h\).
 init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the GRU’s input units (\(W\)). Maybe be None to use default initialization.  inner_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the GRU’s inner recurrent units (\(U\)). Maybe beNone
to use default initialization.  bias_init – A callable or scalar used to initialize the bias values for
both the GRU’s inner and input units. Maybe be
None
to use default initialization.
Variables: h (Variable) – Hidden vector that indicates the state of
StatefulGRU
.See also
GRU
StatefulPeepholeLSTM¶

class
chainer.links.
StatefulPeepholeLSTM
(in_size, out_size)[source]¶ Fullyconnected LSTM layer with peephole connections.
This is a fullyconnected LSTM layer with peephole connections as a chain. Unlike the
LSTM
link, this chain holdspeep_i
,peep_f
andpeep_o
as child links besidesupward
andlateral
.Given a input vector \(x\), Peephole returns the next hidden vector \(h'\) defined as
\[\begin{split}a &=& \tanh(upward x + lateral h), \\ i &=& \sigma(upward x + lateral h + peep_i c), \\ f &=& \sigma(upward x + lateral h + peep_f c), \\ c' &=& a \odot i + f \odot c, \\ o &=& \sigma(upward x + lateral h + peep_o c'), \\ h' &=& o \tanh(c'),\end{split}\]where \(\sigma\) is the sigmoid function, \(\odot\) is the elementwise product, \(c\) is the current cell state, \(c'\) is the next cell state and \(h\) is the current hidden vector.
Parameters: Variables:  upward (Linear) – Linear layer of upward connections.
 lateral (Linear) – Linear layer of lateral connections.
 peep_i (Linear) – Linear layer of peephole connections to the input gate.
 peep_f (Linear) – Linear layer of peephole connections to the forget gate.
 peep_o (Linear) – Linear layer of peephole connections to the output gate.
 c (Variable) – Cell states of LSTM units.
 h (Variable) – Output at the current time step.
StatelessLSTM¶

class
chainer.links.
StatelessLSTM
(in_size, out_size, lateral_init=None, upward_init=None, bias_init=0, forget_bias_init=0)[source]¶ Stateless LSTM layer.
This is a fullyconnected LSTM layer as a chain. Unlike the
lstm()
function, this chain holds upward and lateral connections as child links. This link doesn’t keep cell and hidden states.Parameters: Variables:  upward (chainer.links.Linear) – Linear layer of upward connections.
 lateral (chainer.links.Linear) – Linear layer of lateral connections.
Activation/loss/normalization functions with parameters¶
BatchNormalization¶

class
chainer.links.
BatchNormalization
(size, decay=0.9, eps=2e05, dtype=<type 'numpy.float32'>, use_gamma=True, use_beta=True, initial_gamma=None, initial_beta=None)[source]¶ Batch normalization layer on outputs of linear or convolution functions.
This link wraps the
batch_normalization()
andfixed_batch_normalization()
functions.It runs in three modes: training mode, finetuning mode, and testing mode.
In training mode, it normalizes the input by batch statistics. It also maintains approximated population statistics by moving averages, which can be used for instant evaluation in testing mode.
In finetuning mode, it accumulates the input to compute population statistics. In order to correctly compute the population statistics, a user must use this mode to feed minibatches running through whole training dataset.
In testing mode, it uses precomputed population statistics to normalize the input variable. The population statistics is approximated if it is computed by training mode, or accurate if it is correctly computed by finetuning mode.
Parameters:  size (int or tuple of ints) – Size (or shape) of channel dimensions.
 decay (float) – Decay rate of moving average. It is used on training.
 eps (float) – Epsilon value for numerical stability.
 dtype (numpy.dtype) – Type to use in computing.
 use_gamma (bool) – If True, use scaling parameter. Otherwise, use unit(1) which makes no effect.
 use_beta (bool) – If True, use shifting parameter. Otherwise, use unit(0) which makes no effect.
See: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Variables:  gamma (Variable) – Scaling parameter.
 beta (Variable) – Shifting parameter.
 avg_mean (Variable) – Population mean.
 avg_var (Variable) – Population variance.
 N (int) – Count of batches given for finetuning.
 decay (float) – Decay rate of moving average. It is used on training.
 eps (float) – Epsilon value for numerical stability. This value is added to the batch variances.

__call__
(x, test=False, finetune=False)[source]¶ Invokes the forward propagation of BatchNormalization.
BatchNormalization accepts additional arguments, which controls three different running mode.
Parameters:  x (Variable) – An input variable.
 test (bool) – If
True
, BatchNormalization runs in testing mode; it normalizes the input using precomputed statistics.  finetune (bool) – If
finetune
isTrue
andtest
isFalse
, BatchNormalization runs in finetuning mode; it accumulates the input array to compute population statistics for normalization, and normalizes the input using batch statistics.
If
test
isFalse
, then BatchNormalization runs in training mode; it computes moving averages of mean and variance for evaluation during training, and normalizes the input using batch statistics.
BinaryHierarchicalSoftmax¶

class
chainer.links.
BinaryHierarchicalSoftmax
(in_size, tree)[source]¶ Hierarchical softmax layer over binary tree.
In natural language applications, vocabulary size is too large to use softmax loss. Instead, the hierarchical softmax uses product of sigmoid functions. It costs only \(O(\log(n))\) time where \(n\) is the vocabulary size in average.
At first a user need to prepare a binary tree whose each leaf is corresponding to a word in a vocabulary. When a word \(x\) is given, exactly one path from the root of the tree to the leaf of the word exists. Let \(\mbox{path}(x) = ((e_1, b_1), \dots, (e_m, b_m))\) be the path of \(x\), where \(e_i\) is an index of \(i\)th internal node, and \(b_i \in \{1, 1\}\) indicates direction to move at \(i\)th internal node (1 is left, and 1 is right). Then, the probability of \(x\) is given as below:
\[\begin{split}P(x) &= \prod_{(e_i, b_i) \in \mbox{path}(x)}P(b_i  e_i) \\ &= \prod_{(e_i, b_i) \in \mbox{path}(x)}\sigma(b_i x^\top w_{e_i}),\end{split}\]where \(\sigma(\cdot)\) is a sigmoid function, and \(w\) is a weight matrix.
This function costs \(O(\log(n))\) time as an average length of paths is \(O(\log(n))\), and \(O(n)\) memory as the number of internal nodes equals \(n  1\).
Parameters:  in_size (int) – Dimension of input vectors.
 tree – A binary tree made with tuples like ((1, 2), 3).
Variables: W (Variable) – Weight parameter matrix.
See: Hierarchical Probabilistic Neural Network Language Model [Morin+, AISTAT2005].

__call__
(x, t)[source]¶ Computes the loss value for given input and ground truth labels.
Parameters: Returns: Loss value.
Return type:

static
create_huffman_tree
(word_counts)[source]¶ Makes a Huffman tree from a dictionary containing word counts.
This method creates a binary Huffman tree, that is required for
BinaryHierarchicalSoftmax
. For example,{0: 8, 1: 5, 2: 6, 3: 4}
is converted to((3, 1), (2, 0))
.Parameters: word_counts (dict of int key and int or float values) – Dictionary representing counts of words. Returns: Binary Huffman tree with tuples and keys of word_coutns
.
CRF1d¶

class
chainer.links.
CRF1d
(n_label)[source]¶ Linearchain conditional random field loss layer.
This link wraps the
crf1d()
function. It holds a transition cost matrix as a parameter.Parameters: n_label (int) – Number of labels. See also
crf1d()
for more detail.Variables: cost (Variable) – Transition cost parameter.
PReLU¶

class
chainer.links.
PReLU
(shape=(), init=0.25)[source]¶ Parametric ReLU function as a link.
Parameters:  shape (tuple of ints) – Shape of the parameter array.
 init (float) – Initial parameter value.
See the paper for details: Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification.
See also
Variables: W (Variable) – Coefficient of parametric ReLU.
Maxout¶

class
chainer.links.
Maxout
(in_size, out_size, pool_size, wscale=1, initialW=None, initial_bias=0)[source]¶ Fullyconnected maxout layer.
Let
M
,P
andN
be an input dimension, a pool size, and an output dimension, respectively. For an input vector \(x\) of sizeM
, it computes\[Y_{i} = \mathrm{max}_{j} (W_{ij\cdot}x + b_{ij}).\]Here \(W\) is a weight tensor of shape
(M, P, N)
, \(b\) an optional bias vector of shape(M, P)
and \(W_{ij\cdot}\) is a subvector extracted from \(W\) by fixing first and second dimensions to \(i\) and \(j\), respectively. Minibatch dimension is omitted in the above equation.As for the actual implementation, this chain has a Linear link with a
(M * P, N)
weight matrix and an optionalM * P
dimensional bias vector.Parameters:  in_size (int) – Dimension of input vectors.
 out_size (int) – Dimension of output vectors.
 pool_size (int) – Number of channels.
 wscale (float) – Scaling factor of the weight matrix.
 initialW (3D array or None) – Initial weight value.
If
None
, then this function useswscale
to initialize.  initial_bias (2D array, float or None) – Initial bias value.
If it is float, initial bias is filled with this value.
If it is
None
, bias is omitted.
Variables: linear (Link) – The Linear link that performs affine transformation.
See also
See also
Goodfellow, I., Wardefarley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout Networks. In Proceedings of the 30th International Conference on Machine Learning (ICML13) (pp. 13191327). URL
NegativeSampling¶

class
chainer.links.
NegativeSampling
(in_size, counts, sample_size, power=0.75)[source]¶ Negative sampling loss layer.
This link wraps the
negative_sampling()
function. It holds the weight matrix as a parameter. It also builds a sampler internally given a list of word counts.Parameters: See also
negative_sampling()
for more detail.Variables: W (Variable) – Weight parameter matrix.
Machine learning models¶
Classifier¶

class
chainer.links.
Classifier
(predictor, lossfun=<function softmax_cross_entropy>, accfun=<function accuracy>)[source]¶ A simple classifier model.
This is an example of chain that wraps another chain. It computes the loss and accuracy based on a given input/label pair.
Parameters: Variables:  predictor (Link) – Predictor network.
 lossfun (function) – Loss function.
 accfun (function) – Function that computes accuracy.
 y (Variable) – Prediction for the last minibatch.
 loss (Variable) – Loss value for the last minibatch.
 accuracy (Variable) – Accuracy for the last minibatch.
 compute_accuracy (bool) – If
True
, compute accuracy on the forward computation. The default value isTrue
.

__call__
(*args)[source]¶ Computes the loss value for an input and label pair.
It also computes accuracy and stores it to the attribute.
Parameters: args (list of ~chainer.Variable) – Input minibatch. The all elements of
args
but last one are features and the last element corresponds to ground truth labels. It feeds features to the predictor and compare the result with ground truth labels.Returns: Loss value. Return type: Variable