• Description:


Causal inference is one of the hallmarks of human intelligence.

Corr2cause is a large-scale dataset of more than 400K samples, on which seventeen existing LLMs are evaluated in the related paper.

Overall, Corr2cause contains 415,944 samples, with 18.57% in valid samples. The average length of the premise is 424.11 tokens, and hypothesis 10.83 tokens. The data is split into 411,452 training samples, 2,246 development and test samples, respectively. Since the main purpose of the dataset is to benchmark the performance of LLMs, the test and development sets have been prioritized to have a comprehensive coverage over all sizes of graphs.

Split Examples
'dev' 2,246
'test' 2,246
'train' 411,452
  • Feature structure:
    'input': Text(shape=(), dtype=string),
    'label': int64,
  • Feature documentation:
Feature Class Shape Dtype Description
input Text string
label Tensor int64
  • Citation:
      title={Can Large Language Models Infer Causation from Correlation?},
      author={Zhijing Jin and Jiarui Liu and Zhiheng Lyu and Spencer Poff and Mrinmaya Sachan and Rada Mihalcea and Mona Diab and Bernhard Schölkopf},