• Description:

Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

    'abstract': Text(shape=(), dtype=string),
    'article': Text(shape=(), dtype=string),
    'section_names': Text(shape=(), dtype=string),
  • Feature documentation:
Feature Class Shape Dtype Description
abstract Text string
article Text string
section_names Text string
   title={A Discourse-Aware Attention Model for Abstractive Summarization of
            Long Documents},
   journal={Proceedings of the 2018 Conference of the North American Chapter of
          the Association for Computational Linguistics: Human Language
          Technologies, Volume 2 (Short Papers)},
   publisher={Association for Computational Linguistics},
   author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},

scientific_papers/arxiv (default config)

  • Config description: Documents from ArXiv repository.

  • Dataset size: 7.07 GiB

  • Splits:

Split Examples
'test' 6,440
'train' 203,037
'validation' 6,436


  • Config description: Documents from PubMed repository.

  • Dataset size: 2.34 GiB

  • Splits:

Split Examples
'test' 6,658
'train' 119,924
'validation' 6,633