Attend the Women in ML Symposium on December 7 Register now

c4_wsrs

Stay organized with collections Save and categorize content based on your preferences.

  • Description:

A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset: https://commoncrawl.org

Split Examples
'train' 9,575,852
'validation' 991,422
  • Feature structure:
FeaturesDict({
    'abbreviated_snippet': Text(shape=(), dtype=object),
    'original_snippet': Text(shape=(), dtype=object),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
abbreviated_snippet Text object
original_snippet Text object
  • Citation:

c4_wsrs/default (default config)