c4_wsrs

  • Description:

A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset: https://commoncrawl.org

Split Examples
'train' 9,575,852
'validation' 991,422
  • Feature structure:
FeaturesDict({
    'abbreviated_snippet': Text(shape=(), dtype=string),
    'original_snippet': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
abbreviated_snippet Text string
original_snippet Text string
  • Citation:

c4_wsrs/default (default config)