reddit_disentanglement

  • Description:

This dataset contains ~3M messages from reddit. Every message is labeled with metadata. The task is to predict the id of its parent message in the corresponding thread. Each record contains a list of messages from one thread. Duplicated and broken records are removed from the dataset.

Features are: - id - message id - text - message text - author - message author - created_utc - message UTC timestamp - link_id - id of the post that the comment relates to Target: - parent_id - id of the parent message in the current thread

Split Examples
  • Features:
FeaturesDict({
    'thread': Sequence({
        'author': Text(shape=(), dtype=tf.string),
        'created_utc': Text(shape=(), dtype=tf.string),
        'id': Text(shape=(), dtype=tf.string),
        'link_id': Text(shape=(), dtype=tf.string),
        'parent_id': Text(shape=(), dtype=tf.string),
        'text': Text(shape=(), dtype=tf.string),
    }),
})
@article{zhu2019did,
  title={Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer},
  author={Zhu, Henghui and Nan, Feng and Wang, Zhiguo and Nallapati, Ramesh and Xiang, Bing},
  journal={arXiv preprint arXiv:1911.10666},
  year={2019}
}