TensorFlow 2.0 RC is available Learn more

lm1b

A benchmark corpus to be used for measuring progress in statistical language modeling. This has almost one billion words in the training data.

lm1b is configured with tfds.text.lm1b.Lm1bConfig and has the following configurations predefined (defaults to the first one):

lm1b/plain_text

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string),
})

lm1b/bytes

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
})

lm1b/subwords8k

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8189>),
})

lm1b/subwords32k

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32711>),
})

Statistics

Split Examples
ALL 30,607,716
TRAIN 30,301,028
TEST 306,688

Urls

Supervised keys (for as_supervised=True)

(u'text', u'text')

Citation

@article{DBLP:journals/corr/ChelbaMSGBK13,
  author    = {Ciprian Chelba and
               Tomas Mikolov and
               Mike Schuster and
               Qi Ge and
               Thorsten Brants and
               Phillipp Koehn},
  title     = {One Billion Word Benchmark for Measuring Progress in Statistical Language
               Modeling},
  journal   = {CoRR},
  volume    = {abs/1312.3005},
  year      = {2013},
  url       = {http://arxiv.org/abs/1312.3005},
  archivePrefix = {arXiv},
  eprint    = {1312.3005},
  timestamp = {Mon, 13 Aug 2018 16:46:16 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/ChelbaMSGBK13},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}