- Description:
ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.
Source code:
tfds.datasets.protein_net.Builder
Versions:
1.0.0
(default): Initial release.
Auto-cached (documentation): No
Feature structure:
FeaturesDict({
'evolutionary': Tensor(shape=(None, 21), dtype=float32),
'id': Text(shape=(), dtype=string),
'length': int32,
'mask': Tensor(shape=(None,), dtype=bool),
'primary': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=20)),
'tertiary': Tensor(shape=(None, 3), dtype=float32),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
evolutionary | Tensor | (None, 21) | float32 | |
id | Text | string | ||
length | Tensor | int32 | ||
mask | Tensor | (None,) | bool | |
primary | Sequence(ClassLabel) | (None,) | int64 | |
tertiary | Tensor | (None, 3) | float32 |
Supervised keys (See
as_supervised
doc):('primary', 'tertiary')
Figure (tfds.show_examples): Not supported.
Citation:
@article{ProteinNet19,
title = { {ProteinNet}: a standardized data set for machine learning of protein structure},
author = {AlQuraishi, Mohammed},
journal = {BMC bioinformatics},
volume = {20},
number = {1},
pages = {1--10},
year = {2019},
publisher = {BioMed Central}
}
protein_net/casp7 (default config)
Download size:
3.18 GiB
Dataset size:
2.53 GiB
Splits:
Split | Examples |
---|---|
'test' |
93 |
'train_100' |
34,557 |
'train_30' |
10,333 |
'train_50' |
13,024 |
'train_70' |
15,207 |
'train_90' |
17,611 |
'train_95' |
17,938 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp8
Download size:
4.96 GiB
Dataset size:
3.55 GiB
Splits:
Split | Examples |
---|---|
'test' |
120 |
'train_100' |
48,087 |
'train_30' |
13,881 |
'train_50' |
17,970 |
'train_70' |
21,191 |
'train_90' |
24,556 |
'train_95' |
25,035 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp9
Download size:
6.65 GiB
Dataset size:
4.54 GiB
Splits:
Split | Examples |
---|---|
'test' |
116 |
'train_100' |
60,350 |
'train_30' |
16,973 |
'train_50' |
22,172 |
'train_70' |
26,263 |
'train_90' |
30,513 |
'train_95' |
31,128 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp10
Download size:
8.65 GiB
Dataset size:
5.57 GiB
Splits:
Split | Examples |
---|---|
'test' |
95 |
'train_100' |
73,116 |
'train_30' |
19,495 |
'train_50' |
25,897 |
'train_70' |
31,001 |
'train_90' |
36,258 |
'train_95' |
37,033 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp11
Download size:
10.81 GiB
Dataset size:
6.72 GiB
Splits:
Split | Examples |
---|---|
'test' |
81 |
'train_100' |
87,573 |
'train_30' |
22,344 |
'train_50' |
29,936 |
'train_70' |
36,005 |
'train_90' |
42,507 |
'train_95' |
43,544 |
'validation' |
224 |
- Examples (tfds.as_dataframe):
protein_net/casp12
Download size:
13.18 GiB
Dataset size:
8.05 GiB
Splits:
Split | Examples |
---|---|
'test' |
40 |
'train_100' |
104,059 |
'train_30' |
25,299 |
'train_50' |
34,039 |
'train_70' |
41,522 |
'train_90' |
49,600 |
'train_95' |
50,914 |
'validation' |
224 |
- Examples (tfds.as_dataframe):