reddit_disentanglement

  • Description:

This dataset contains ~3M messages from reddit. Every message is labeled with metadata. The task is to predict the id of its parent message in the corresponding thread. Each record contains a list of messages from one thread. Duplicated and broken records are removed from the dataset.

Features are: - id - message id - text - message text - author - message author - created_utc - message UTC timestamp - link_id - id of the post that the comment relates to Target: - parent_id - id of the parent message in the current thread

Split Examples
  • Feature structure:
FeaturesDict({
    'thread': Sequence({
        'author': Text(shape=(), dtype=tf.string),
        'created_utc': Text(shape=(), dtype=tf.string),
        'id': Text(shape=(), dtype=tf.string),
        'link_id': Text(shape=(), dtype=tf.string),
        'parent_id': Text(shape=(), dtype=tf.string),
        'text': Text(shape=(), dtype=tf.string),
    }),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
thread Sequence
thread/author Text tf.string
thread/created_utc Text tf.string
thread/id Text tf.string
thread/link_id Text tf.string
thread/parent_id Text tf.string
thread/text Text tf.string
@article{zhu2019did,
  title={Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer},
  author={Zhu, Henghui and Nan, Feng and Wang, Zhiguo and Nallapati, Ramesh and Xiang, Bing},
  journal={arXiv preprint arXiv:1911.10666},
  year={2019}
}