• Description:

'ogbg-molpcba' is a molecular dataset sampled from PubChem BioAssay. It is a graph prediction dataset from the Open Graph Benchmark (OGB).

This dataset is experimental, and the API is subject to change in future releases.

The below description of the dataset is adapted from the OGB paper:

Input Format

All the molecules are pre-processed using RDKit ([1]).

  • Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds.
  • Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring.
  • Input edge features are 3-dimensional, containing bond type, bond stereochemistry, as well as an additional bond feature indicating whether the bond is conjugated.

The exact description of all features is available at https://github.com/snap-stanford/ogb/blob/master/ogb/utils/features.py


The task is to predict 128 different biological activities (inactive/active). See [2] and [3] for more description about these targets. Not all targets apply to each molecule: missing targets are indicated by NaNs.


[1]: Greg Landrum, et al. 'RDKit: Open-source cheminformatics'. URL: https://github.com/rdkit/rdkit

[2]: Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding and Vijay Pande. 'Massively Multitask Networks for Drug Discovery'. URL: https://arxiv.org/pdf/1502.02072.pdf

[3]: Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513-530, 2018.

Split Examples
'test' 43,793
'train' 350,343
'validation' 43,793
  • Feature structure:
    'edge_feat': Tensor(shape=(None, 3), dtype=float32),
    'edge_index': Tensor(shape=(None, 2), dtype=int64),
    'labels': Tensor(shape=(128,), dtype=float32),
    'node_feat': Tensor(shape=(None, 9), dtype=float32),
    'num_edges': Tensor(shape=(None,), dtype=int64),
    'num_nodes': Tensor(shape=(None,), dtype=int64),
  • Feature documentation:
Feature Class Shape Dtype Description
edge_feat Tensor (None, 3) float32
edge_index Tensor (None, 2) int64
labels Tensor (128,) float32
node_feat Tensor (None, 9) float32
num_edges Tensor (None,) int64
num_nodes Tensor (None,) int64


  • Citation:
  author    = {Weihua Hu and
               Matthias Fey and
               Marinka Zitnik and
               Yuxiao Dong and
               Hongyu Ren and
               Bowen Liu and
               Michele Catasta and
               Jure Leskovec},
  editor    = {Hugo Larochelle and
               Marc Aurelio Ranzato and
               Raia Hadsell and
               Maria{-}Florina Balcan and
               Hsuan{-}Tien Lin},
  title     = {Open Graph Benchmark: Datasets for Machine Learning on Graphs},
  booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference
               on Neural Information Processing Systems 2020, NeurIPS 2020, December
               6-12, 2020, virtual},
  year      = {2020},
  url       = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html},
  timestamp = {Tue, 19 Jan 2021 15:57:06 +0100},
  biburl    = {https://dblp.org/rec/conf/nips/HuFZDRLCL20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}