TF 2.0 is out! Get hands-on practice at TF World, Oct 28-31. Use code TF20 for 20% off select passes. Register now

lm1b

A benchmark corpus to be used for measuring progress in statistical language modeling. This has almost one billion words in the training data.

lm1b is configured with tfds.text.lm1b.Lm1bConfig and has the following configurations predefined (defaults to the first one):

  • plain_text (v0.0.1) (Size: 1.67 GiB): Plain text

  • bytes (v0.0.1) (Size: 1.67 GiB): Uses byte-level text encoding with tfds.features.text.ByteTextEncoder

  • subwords8k (v0.0.2) (Size: 1.67 GiB): Uses tfds.features.text.SubwordTextEncoder with 8k vocab size

  • subwords32k (v0.0.2) (Size: 1.67 GiB): Uses tfds.features.text.SubwordTextEncoder with 32k vocab size

lm1b/plain_text

Plain text

Versions:

  • 0.0.1 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 30,607,716
TRAIN 30,301,028
TEST 306,688

Features

FeaturesDict({
    'text': Text(shape=(), dtype=tf.string),
})

Urls

Supervised keys (for as_supervised=True)

(u'text', u'text')

lm1b/bytes

Uses byte-level text encoding with tfds.features.text.ByteTextEncoder

Versions:

  • 0.0.1 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 30,607,716
TRAIN 30,301,028
TEST 306,688

Features

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
})

Urls

Supervised keys (for as_supervised=True)

(u'text', u'text')

lm1b/subwords8k

Uses tfds.features.text.SubwordTextEncoder with 8k vocab size

Versions:

  • 0.0.2 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 30,607,716
TRAIN 30,301,028
TEST 306,688

Features

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8189>),
})

Urls

Supervised keys (for as_supervised=True)

(u'text', u'text')

lm1b/subwords32k

Uses tfds.features.text.SubwordTextEncoder with 32k vocab size

Versions:

  • 0.0.2 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 30,607,716
TRAIN 30,301,028
TEST 306,688

Features

FeaturesDict({
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32711>),
})

Urls

Supervised keys (for as_supervised=True)

(u'text', u'text')

Citation

@article{DBLP:journals/corr/ChelbaMSGBK13,
  author    = {Ciprian Chelba and
               Tomas Mikolov and
               Mike Schuster and
               Qi Ge and
               Thorsten Brants and
               Phillipp Koehn},
  title     = {One Billion Word Benchmark for Measuring Progress in Statistical Language
               Modeling},
  journal   = {CoRR},
  volume    = {abs/1312.3005},
  year      = {2013},
  url       = {http://arxiv.org/abs/1312.3005},
  archivePrefix = {arXiv},
  eprint    = {1312.3005},
  timestamp = {Mon, 13 Aug 2018 16:46:16 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/ChelbaMSGBK13},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}