imdb_reviews

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

imdb_reviews is configured with tfds.text.imdb.IMDBReviewsConfig and has the following configurations predefined (defaults to the first one):

  • plain_text (v0.1.0) (Size: 80.23 MiB): Plain text

  • bytes (v0.1.0) (Size: 80.23 MiB): Uses byte-level text encoding with tfds.features.text.ByteTextEncoder

  • subwords8k (v0.1.0) (Size: 80.23 MiB): Uses tfds.features.text.SubwordTextEncoder with 8k vocab size

  • subwords32k (v0.1.0) (Size: 80.23 MiB): Uses tfds.features.text.SubwordTextEncoder with 32k vocab size

imdb_reviews/plain_text

Plain text

Versions:

  • 0.1.0 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 100,000
UNSUPERVISED 50,000
TEST 25,000
TRAIN 25,000

Features

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(), dtype=tf.string),
})

Homepage

Supervised keys (for as_supervised=True)

(u'text', u'label')

imdb_reviews/bytes

Uses byte-level text encoding with tfds.features.text.ByteTextEncoder

Versions:

  • 0.1.0 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 100,000
UNSUPERVISED 50,000
TEST 25,000
TRAIN 25,000

Features

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
})

Homepage

Supervised keys (for as_supervised=True)

(u'text', u'label')

imdb_reviews/subwords8k

Uses tfds.features.text.SubwordTextEncoder with 8k vocab size

Versions:

  • 0.1.0 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 100,000
UNSUPERVISED 50,000
TEST 25,000
TRAIN 25,000

Features

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>),
})

Homepage

Supervised keys (for as_supervised=True)

(u'text', u'label')

imdb_reviews/subwords32k

Uses tfds.features.text.SubwordTextEncoder with 32k vocab size

Versions:

  • 0.1.0 (default):
  • 1.0.0: New split API (https://tensorflow.org/datasets/splits)

Statistics

Split Examples
ALL 100,000
UNSUPERVISED 50,000
TEST 25,000
TRAIN 25,000

Features

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32650>),
})

Homepage

Supervised keys (for as_supervised=True)

(u'text', u'label')

Citation

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}