- Description:
This dataset contains ~3M messages from reddit. Every message is labeled with metadata. The task is to predict the id of its parent message in the corresponding thread. Each record contains a list of messages from one thread. Duplicated and broken records are removed from the dataset.
Features are:
- id - message id
- text - message text
- author - message author
- created_utc - message UTC timestamp
- link_id - id of the post that the comment relates to
Target:
parent_id - id of the parent message in the current thread
Homepage: https://github.com/henghuiz/MaskedHierarchicalTransformer
Source code:
tfds.datasets.reddit_disentanglement.Builder
Versions:
2.0.0
(default): No release notes.
Download size:
Unknown size
Dataset size:
Unknown size
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):
Download https://github.com/henghuiz/MaskedHierarchicalTransformer, decompress raw_data.zip and run generate_dataset.py with your reddit api credentials. Then put train.csv, val.csv and test.csv from the output directory into the manual folder.Auto-cached (documentation): Unknown
Splits:
Split | Examples |
---|
- Feature structure:
FeaturesDict({
'thread': Sequence({
'author': Text(shape=(), dtype=string),
'created_utc': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'link_id': Text(shape=(), dtype=string),
'parent_id': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
}),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
thread | Sequence | |||
thread/author | Text | string | ||
thread/created_utc | Text | string | ||
thread/id | Text | string | ||
thread/link_id | Text | string | ||
thread/parent_id | Text | string | ||
thread/text | Text | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe): Missing.
Citation:
@article{zhu2019did,
title={Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer},
author={Zhu, Henghui and Nan, Feng and Wang, Zhiguo and Nallapati, Ramesh and Xiang, Bing},
journal={arXiv preprint arXiv:1911.10666},
year={2019}
}