TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

scientific_papers

Description:

Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

article: the body of the document, pagragraphs seperated by "/n".
abstract: the abstract of the document, pagragraphs seperated by "/n".
section_names: titles of sections, seperated by "/n".
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/armancohan/long-summarization
Source code: tfds.datasets.scientific_papers.Builder
Versions:
- 1.1.0: No release notes.
- 1.1.1 (default): No release notes.
Download size: 4.20 GiB
Auto-cached (documentation): No
Feature structure:

FeaturesDict({
    'abstract': Text(shape=(), dtype=string),
    'article': Text(shape=(), dtype=string),
    'section_names': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
abstract	Text	string
article	Text	string
section_names	Text	string

Supervised keys (See as_supervised doc): ('article', 'abstract')
Figure (tfds.show_examples): Not supported.
Citation:

@article{Cohan_2018,
   title={A Discourse-Aware Attention Model for Abstractive Summarization of
            Long Documents},
   url={http://dx.doi.org/10.18653/v1/n18-2097},
   DOI={10.18653/v1/n18-2097},
   journal={Proceedings of the 2018 Conference of the North American Chapter of
          the Association for Computational Linguistics: Human Language
          Technologies, Volume 2 (Short Papers)},
   publisher={Association for Computational Linguistics},
   author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
   year={2018}
}

scientific_papers/arxiv (default config)

Config description: Documents from ArXiv repository.
Dataset size: 7.07 GiB
Splits:

Split	Examples
`'test'`	6,440
`'train'`	203,037
`'validation'`	6,436

Examples (tfds.as_dataframe):

scientific_papers/pubmed

Config description: Documents from PubMed repository.
Dataset size: 2.34 GiB
Splits:

Split	Examples
`'test'`	6,658
`'train'`	119,924
`'validation'`	6,633

Examples (tfds.as_dataframe):

scientific_papers Stay organized with collections Save and categorize content based on your preferences.

scientific_papers/arxiv (default config)

scientific_papers/pubmed

scientific_papers