- Description:
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Source code:
tfds.text.IMDBReviews
Versions:
1.0.0
(default): New split API (https://tensorflow.org/datasets/splits)
Download size:
80.23 MiB
Dataset size:
Unknown size
Auto-cached (documentation): Unknown
Splits:
Split | Examples |
---|---|
'test' |
25,000 |
'train' |
25,000 |
'unsupervised' |
50,000 |
Supervised keys (See
as_supervised
doc):('text', 'label')
Citation:
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
- Figure (tfds.show_examples): Not supported.
imdb_reviews/plain_text (default config)
Config description: Plain text
Features:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
'text': Text(shape=(), dtype=tf.string),
})
- Examples (tfds.as_dataframe):
imdb_reviews/bytes
Config description: Uses byte-level text encoding with
tfds.deprecated.text.ByteTextEncoder
Features:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
'text': Text(shape=(None,), dtype=tf.int64, encoder=<ByteTextEncoder vocab_size=257>),
})
- Examples (tfds.as_dataframe):
imdb_reviews/subwords8k
Config description: Uses
tfds.deprecated.text.SubwordTextEncoder
with 8k vocab sizeFeatures:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=8185>),
})
- Examples (tfds.as_dataframe):
imdb_reviews/subwords32k
Config description: Uses
tfds.deprecated.text.SubwordTextEncoder
with 32k vocab sizeFeatures:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
'text': Text(shape=(None,), dtype=tf.int64, encoder=<SubwordTextEncoder vocab_size=32650>),
})
- Examples (tfds.as_dataframe):