TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

civil_comments

Description:

This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

Homepage: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
Source code: tfds.text.CivilComments
Versions:
- 1.0.0: Initial full release.
- 1.0.1: Added a unique id for each comment.
- 1.1.0: Added CivilCommentsCovert config.
- 1.1.1: Added CivilCommentsCovert config with correct checksum.
- 1.1.2: Added separate citation for CivilCommentsCovert dataset.
- 1.1.3: Corrected id types from float to string.
- 1.2.0: Add toxic spans, context, and parent comment text features.
- 1.2.1: Fix incorrect formatting in context splits.
- 1.2.2: Update to reflect context only having a train split.
- 1.2.3: Add warning to CivilCommentsCovert as we fix a data issue.
- 1.2.4 (default): Add publication IDs and comment timestamps.
Download size: 427.41 MiB
Figure (tfds.show_examples): Not supported.

civil_comments/CivilComments (default config)

Config description: The CivilComments set here includes all the data, but only the basic seven labels (toxicity, severe_toxicity, obscene, threat, insult, identity_attack, and sexual_explicit).
Dataset size: 1.54 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'test'`	97,320
`'train'`	1,804,874
`'validation'`	97,320

Feature structure:

FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
article_id	Tensor	int32
created_date	Tensor	string
id	Tensor	string
identity_attack	Tensor	float32
insult	Tensor	float32
obscene	Tensor	float32
parent_id	Tensor	int32
parent_text	Text	string
publication_id	Tensor	string
severe_toxicity	Tensor	float32
sexual_explicit	Tensor	float32
text	Text	string
threat	Tensor	float32
toxicity	Tensor	float32

Supervised keys (See as_supervised doc): ('text', 'toxicity')
Examples (tfds.as_dataframe):

Citation:

@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

civil_comments/CivilCommentsIdentities

Config description: The CivilCommentsIdentities set here includes an extended set of identity labels in addition to the basic seven labels. However, it only includes the subset (roughly a quarter) of the data with all these features.
Dataset size: 654.97 MiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'test'`	21,577
`'train'`	405,130
`'validation'`	21,293

Feature structure:

FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'created_date': string,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
article_id	Tensor	int32
asian	Tensor	float32
atheist	Tensor	float32
bisexual	Tensor	float32
black	Tensor	float32
buddhist	Tensor	float32
christian	Tensor	float32
created_date	Tensor	string
female	Tensor	float32
heterosexual	Tensor	float32
hindu	Tensor	float32
homosexual_gay_or_lesbian	Tensor	float32
id	Tensor	string
identity_attack	Tensor	float32
insult	Tensor	float32
intellectual_or_learning_disability	Tensor	float32
jewish	Tensor	float32
latino	Tensor	float32
male	Tensor	float32
muslim	Tensor	float32
obscene	Tensor	float32
other_disability	Tensor	float32
other_gender	Tensor	float32
other_race_or_ethnicity	Tensor	float32
other_religion	Tensor	float32
other_sexual_orientation	Tensor	float32
parent_id	Tensor	int32
parent_text	Text	string
physical_disability	Tensor	float32
psychiatric_or_mental_illness	Tensor	float32
publication_id	Tensor	string
severe_toxicity	Tensor	float32
sexual_explicit	Tensor	float32
text	Text	string
threat	Tensor	float32
toxicity	Tensor	float32
transgender	Tensor	float32
white	Tensor	float32

Supervised keys (See as_supervised doc): ('text', 'toxicity')
Examples (tfds.as_dataframe):

Citation:

@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

civil_comments/CivilCommentsCovert

Config description: WARNING: there's a potential data quality issue with CivilCommentsCovert that we're actively working on fixing (06/28/22); the underlying data may change!

The CivilCommentsCovert set is a subset of CivilCommentsIdentities with ~20% of the train and test splits further annotated for covert offensiveness, in addition to the toxicity and identity labels. Raters were asked to categorize comments as one of explicitly, implicitly, not, or not sure if offensive, as well as whether it contained different types of covert offensiveness. The full annotation procedure is detailed in a forthcoming paper at https://sites.google.com/corp/view/hciandnlp/accepted-papers

Dataset size: 97.83 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'test'`	2,455
`'train'`	48,074

Feature structure:

FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'covert_emoticons_emojis': float32,
    'covert_humor': float32,
    'covert_masked_harm': float32,
    'covert_microaggression': float32,
    'covert_obfuscation': float32,
    'covert_political': float32,
    'covert_sarcasm': float32,
    'created_date': string,
    'explicitly_offensive': float32,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'implicitly_offensive': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'not_offensive': float32,
    'not_sure_offensive': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
article_id	Tensor	int32
asian	Tensor	float32
atheist	Tensor	float32
bisexual	Tensor	float32
black	Tensor	float32
buddhist	Tensor	float32
christian	Tensor	float32
covert_emoticons_emojis	Tensor	float32
covert_humor	Tensor	float32
covert_masked_harm	Tensor	float32
covert_microaggression	Tensor	float32
covert_obfuscation	Tensor	float32
covert_political	Tensor	float32
covert_sarcasm	Tensor	float32
created_date	Tensor	string
explicitly_offensive	Tensor	float32
female	Tensor	float32
heterosexual	Tensor	float32
hindu	Tensor	float32
homosexual_gay_or_lesbian	Tensor	float32
id	Tensor	string
identity_attack	Tensor	float32
implicitly_offensive	Tensor	float32
insult	Tensor	float32
intellectual_or_learning_disability	Tensor	float32
jewish	Tensor	float32
latino	Tensor	float32
male	Tensor	float32
muslim	Tensor	float32
not_offensive	Tensor	float32
not_sure_offensive	Tensor	float32
obscene	Tensor	float32
other_disability	Tensor	float32
other_gender	Tensor	float32
other_race_or_ethnicity	Tensor	float32
other_religion	Tensor	float32
other_sexual_orientation	Tensor	float32
parent_id	Tensor	int32
parent_text	Text	string
physical_disability	Tensor	float32
psychiatric_or_mental_illness	Tensor	float32
publication_id	Tensor	string
severe_toxicity	Tensor	float32
sexual_explicit	Tensor	float32
text	Text	string
threat	Tensor	float32
toxicity	Tensor	float32
transgender	Tensor	float32
white	Tensor	float32

Supervised keys (See as_supervised doc): ('text', 'toxicity')
Examples (tfds.as_dataframe):

Citation:

@inproceedings{lees-etal-2021-capturing,
    title = "Capturing Covertly Toxic Speech via Crowdsourcing",
    author = "Lees, Alyssa  and
      Borkan, Daniel  and
      Kivlichan, Ian  and
      Nario, Jorge  and
      Goyal, Tesh",
    booktitle = "Proceedings of the First Workshop on Bridging Human{--}Computer Interaction and Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.hcinlp-1.3",
    pages = "14--20"
}

civil_comments/CivilCommentsToxicSpans

Config description: The CivilComments Toxic Spans are a subset of CivilComments that is labeled at the span level - the indices of all character (unicode codepoints) boundaries that were tagged as toxic by a majority of the annotators is returned in a 'spans' feature.
Dataset size: 5.81 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'test'`	2,000
`'train'`	7,939
`'validation'`	682

Feature structure:

FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'spans': Tensor(shape=(None,), dtype=int32),
    'text': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
article_id	Tensor		int32
created_date	Tensor		string
id	Tensor		string
parent_id	Tensor		int32
parent_text	Text		string
publication_id	Tensor		string
spans	Tensor	(None,)	int32
text	Text		string

Supervised keys (See as_supervised doc): ('text', 'spans')
Examples (tfds.as_dataframe):

Citation:

@inproceedings{pavlopoulos-etal-2021-semeval,
    title = "{S}em{E}val-2021 Task 5: Toxic Spans Detection",
    author = "Pavlopoulos, John  and Sorensen, Jeffrey  and Laugier, L{'e}o and Androutsopoulos, Ion",
    booktitle = "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.semeval-1.6",
    doi = "10.18653/v1/2021.semeval-1.6",
    pages = "59--69",
}

civil_comments/CivilCommentsInContext

Config description: The CivilComments in Context is a subset of CivilComments that was labeled by making available to the labelers the parent_text. It includes a contextual_toxicity feature.
Dataset size: 9.63 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	9,969

Feature structure:

FeaturesDict({
    'article_id': int32,
    'contextual_toxicity': float32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
article_id	Tensor	int32
contextual_toxicity	Tensor	float32
created_date	Tensor	string
id	Tensor	string
identity_attack	Tensor	float32
insult	Tensor	float32
obscene	Tensor	float32
parent_id	Tensor	int32
parent_text	Text	string
publication_id	Tensor	string
severe_toxicity	Tensor	float32
sexual_explicit	Tensor	float32
text	Text	string
threat	Tensor	float32
toxicity	Tensor	float32

Supervised keys (See as_supervised doc): ('text', 'toxicity')
Examples (tfds.as_dataframe):

Citation:

@misc{pavlopoulos2020toxicity,
    title={Toxicity Detection: Does Context Really Matter?},
    author={John Pavlopoulos and Jeffrey Sorensen and Lucas Dixon and Nithum Thain and Ion Androutsopoulos},
    year={2020}, eprint={2006.00998}, archivePrefix={arXiv}, primaryClass={cs.CL}
}