• Description:

DocNLI is a large-scale dataset for document-level natural language inference (NLI). DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. In contrast to some existing sentence-level NLI datasets, DocNLI has pretty limited artifacts.

Split Examples
'test' 267,086
'train' 942,314
'validation' 234,258
  • Features:
    'hypothesis': Text(shape=(), dtype=tf.string),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'premise': Text(shape=(), dtype=tf.string),
  • Citation:
    title={DocNLI: A Large-scale Dataset for Document-level Natural Language Inference},
    author={Wenpeng Yin and Dragomir Radev and Caiming Xiong},
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",