scientific_papers

Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features: - article: the body of the document, pagragraphs seperated by "/n". - abstract: the abstract of the document, pagragraphs seperated by "/n". - section_names: titles of sections, seperated by "/n".

scientific_papers is configured with tfds.summarization.scientific_papers.ScientificPapersConfig and has the following configurations predefined (defaults to the first one):

  • arxiv (v1.1.0) (Size: 4.20 GiB): Documents from ArXiv repository.

  • pubmed (v1.1.0) (Size: 4.20 GiB): Documents from PubMed repository.

scientific_papers/arxiv

Documents from ArXiv repository.

Versions:

  • 1.1.0 (default):

Statistics

Split Examples
ALL 215,913
TRAIN 203,037
TEST 6,440
VALIDATION 6,436

Features

FeaturesDict({
    'abstract': Text(shape=(), dtype=tf.string),
    'article': Text(shape=(), dtype=tf.string),
    'section_names': Text(shape=(), dtype=tf.string),
})

Homepage

Supervised keys (for as_supervised=True)

(u'article', u'abstract')

scientific_papers/pubmed

Documents from PubMed repository.

Versions:

  • 1.1.0 (default):

Statistics

Split Examples
ALL 133,215
TRAIN 119,924
TEST 6,658
VALIDATION 6,633

Features

FeaturesDict({
    'abstract': Text(shape=(), dtype=tf.string),
    'article': Text(shape=(), dtype=tf.string),
    'section_names': Text(shape=(), dtype=tf.string),
})

Homepage

Supervised keys (for as_supervised=True)

(u'article', u'abstract')

Citation

@article{Cohan_2018,
   title={A Discourse-Aware Attention Model for Abstractive Summarization of
            Long Documents},
   url={http://dx.doi.org/10.18653/v1/n18-2097},
   DOI={10.18653/v1/n18-2097},
   journal={Proceedings of the 2018 Conference of the North American Chapter of
          the Association for Computational Linguistics: Human Language
          Technologies, Volume 2 (Short Papers)},
   publisher={Association for Computational Linguistics},
   author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
   year={2018}
}