  • Description:

A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset:

Split Examples
'train' 9,575,852
'validation' 991,422
  • Feature structure:
    'abbreviated_snippet': Text(shape=(), dtype=object),
    'original_snippet': Text(shape=(), dtype=object),
  • Feature documentation:
Feature Class Shape Dtype Description
abbreviated_snippet Text object
original_snippet Text object
c4_wsrs/default (default config)