chainer.datasets.get_ptb_words¶
-
chainer.datasets.
get_ptb_words
()[source]¶ Gets the Penn Tree Bank dataset as long word sequences.
Penn Tree Bank is originally a corpus of English sentences with linguistic structure annotations. This function uses a variant distributed at https://github.com/wojzaremba/lstm, which omits the annotation and splits the dataset into three parts: training, validation, and test.
This function returns the training, validation, and test sets, each of which is represented as a long array of word IDs. All sentences in the dataset are concatenated by End-of-Sentence mark ‘<eos>’, which is treated as one of the vocabulary.
Returns: Int32 vectors of word IDs. Return type: tuple of numpy.ndarray See also
Use
get_ptb_words_vocabulary()
to get the mapping between the words and word IDs.