- Description:
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
Both "arxiv" and "pubmed" have two features:
- article: the body of the document, pagragraphs seperated by "/n".
- abstract: the abstract of the document, pagragraphs seperated by "/n".
section_names: titles of sections, seperated by "/n".
Additional Documentation: Explore on Papers With Code
Source code:
tfds.datasets.scientific_papers.Builder
Versions:
1.1.0
: No release notes.1.1.1
(default): No release notes.
Download size:
4.20 GiB
Auto-cached (documentation): No
Feature structure:
FeaturesDict({
'abstract': Text(shape=(), dtype=string),
'article': Text(shape=(), dtype=string),
'section_names': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
abstract | Text | string | ||
article | Text | string | ||
section_names | Text | string |
Supervised keys (See
as_supervised
doc):('article', 'abstract')
Figure (tfds.show_examples): Not supported.
Citation:
@article{Cohan_2018,
title={A Discourse-Aware Attention Model for Abstractive Summarization of
Long Documents},
url={http://dx.doi.org/10.18653/v1/n18-2097},
DOI={10.18653/v1/n18-2097},
journal={Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language
Technologies, Volume 2 (Short Papers)},
publisher={Association for Computational Linguistics},
author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
year={2018}
}
scientific_papers/arxiv (default config)
Config description: Documents from ArXiv repository.
Dataset size:
7.07 GiB
Splits:
Split | Examples |
---|---|
'test' |
6,440 |
'train' |
203,037 |
'validation' |
6,436 |
- Examples (tfds.as_dataframe):
scientific_papers/pubmed
Config description: Documents from PubMed repository.
Dataset size:
2.34 GiB
Splits:
Split | Examples |
---|---|
'test' |
6,658 |
'train' |
119,924 |
'validation' |
6,633 |
- Examples (tfds.as_dataframe):