- Description:
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.
The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.
The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.
For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.
Homepage: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
Source code:
tfds.text.CivilComments
Versions:
1.0.0
: Initial full release.1.0.1
: Added a unique id for each comment.1.1.0
: Added CivilCommentsCovert config.1.1.1
: Added CivilCommentsCovert config with correct checksum.1.1.2
: Added separate citation for CivilCommentsCovert dataset.1.1.3
: Corrected id types from float to string.1.2.0
: Add toxic spans, context, and parent comment text features.1.2.1
: Fix incorrect formatting in context splits.1.2.2
: Update to reflect context only having a train split.1.2.3
: Add warning to CivilCommentsCovert as we fix a data issue.1.2.4
(default): Add publication IDs and comment timestamps.
Download size:
427.41 MiB
Figure (tfds.show_examples): Not supported.
civil_comments/CivilComments (default config)
Config description: The CivilComments set here includes all the data, but only the basic seven labels (toxicity, severe_toxicity, obscene, threat, insult, identity_attack, and sexual_explicit).
Dataset size:
1.54 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
97,320 |
'train' |
1,804,874 |
'validation' |
97,320 |
- Feature structure:
FeaturesDict({
'article_id': int32,
'created_date': string,
'id': string,
'identity_attack': float32,
'insult': float32,
'obscene': float32,
'parent_id': int32,
'parent_text': Text(shape=(), dtype=string),
'publication_id': string,
'severe_toxicity': float32,
'sexual_explicit': float32,
'text': Text(shape=(), dtype=string),
'threat': float32,
'toxicity': float32,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
article_id | Tensor | int32 | ||
created_date | Tensor | string | ||
id | Tensor | string | ||
identity_attack | Tensor | float32 | ||
insult | Tensor | float32 | ||
obscene | Tensor | float32 | ||
parent_id | Tensor | int32 | ||
parent_text | Text | string | ||
publication_id | Tensor | string | ||
severe_toxicity | Tensor | float32 | ||
sexual_explicit | Tensor | float32 | ||
text | Text | string | ||
threat | Tensor | float32 | ||
toxicity | Tensor | float32 |
Supervised keys (See
as_supervised
doc):('text', 'toxicity')
Examples (tfds.as_dataframe):
- Citation:
@article{DBLP:journals/corr/abs-1903-04561,
author = {Daniel Borkan and
Lucas Dixon and
Jeffrey Sorensen and
Nithum Thain and
Lucy Vasserman},
title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
Classification},
journal = {CoRR},
volume = {abs/1903.04561},
year = {2019},
url = {http://arxiv.org/abs/1903.04561},
archivePrefix = {arXiv},
eprint = {1903.04561},
timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
civil_comments/CivilCommentsIdentities
Config description: The CivilCommentsIdentities set here includes an extended set of identity labels in addition to the basic seven labels. However, it only includes the subset (roughly a quarter) of the data with all these features.
Dataset size:
654.97 MiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
21,577 |
'train' |
405,130 |
'validation' |
21,293 |
- Feature structure:
FeaturesDict({
'article_id': int32,
'asian': float32,
'atheist': float32,
'bisexual': float32,
'black': float32,
'buddhist': float32,
'christian': float32,
'created_date': string,
'female': float32,
'heterosexual': float32,
'hindu': float32,
'homosexual_gay_or_lesbian': float32,
'id': string,
'identity_attack': float32,
'insult': float32,
'intellectual_or_learning_disability': float32,
'jewish': float32,
'latino': float32,
'male': float32,
'muslim': float32,
'obscene': float32,
'other_disability': float32,
'other_gender': float32,
'other_race_or_ethnicity': float32,
'other_religion': float32,
'other_sexual_orientation': float32,
'parent_id': int32,
'parent_text': Text(shape=(), dtype=string),
'physical_disability': float32,
'psychiatric_or_mental_illness': float32,
'publication_id': string,
'severe_toxicity': float32,
'sexual_explicit': float32,
'text': Text(shape=(), dtype=string),
'threat': float32,
'toxicity': float32,
'transgender': float32,
'white': float32,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
article_id | Tensor | int32 | ||
asian | Tensor | float32 | ||
atheist | Tensor | float32 | ||
bisexual | Tensor | float32 | ||
black | Tensor | float32 | ||
buddhist | Tensor | float32 | ||
christian | Tensor | float32 | ||
created_date | Tensor | string | ||
female | Tensor | float32 | ||
heterosexual | Tensor | float32 | ||
hindu | Tensor | float32 | ||
homosexual_gay_or_lesbian | Tensor | float32 | ||
id | Tensor | string | ||
identity_attack | Tensor | float32 | ||
insult | Tensor | float32 | ||
intellectual_or_learning_disability | Tensor | float32 | ||
jewish | Tensor | float32 | ||
latino | Tensor | float32 | ||
male | Tensor | float32 | ||
muslim | Tensor | float32 | ||
obscene | Tensor | float32 | ||
other_disability | Tensor | float32 | ||
other_gender | Tensor | float32 | ||
other_race_or_ethnicity | Tensor | float32 | ||
other_religion | Tensor | float32 | ||
other_sexual_orientation | Tensor | float32 | ||
parent_id | Tensor | int32 | ||
parent_text | Text | string | ||
physical_disability | Tensor | float32 | ||
psychiatric_or_mental_illness | Tensor | float32 | ||
publication_id | Tensor | string | ||
severe_toxicity | Tensor | float32 | ||
sexual_explicit | Tensor | float32 | ||
text | Text | string | ||
threat | Tensor | float32 | ||
toxicity | Tensor | float32 | ||
transgender | Tensor | float32 | ||
white | Tensor | float32 |
Supervised keys (See
as_supervised
doc):('text', 'toxicity')
Examples (tfds.as_dataframe):
- Citation:
@article{DBLP:journals/corr/abs-1903-04561,
author = {Daniel Borkan and
Lucas Dixon and
Jeffrey Sorensen and
Nithum Thain and
Lucy Vasserman},
title = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
Classification},
journal = {CoRR},
volume = {abs/1903.04561},
year = {2019},
url = {http://arxiv.org/abs/1903.04561},
archivePrefix = {arXiv},
eprint = {1903.04561},
timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
civil_comments/CivilCommentsCovert
- Config description: WARNING: there's a potential data quality issue with CivilCommentsCovert that we're actively working on fixing (06/28/22); the underlying data may change!
The CivilCommentsCovert set is a subset of CivilCommentsIdentities with ~20% of the train and test splits further annotated for covert offensiveness, in addition to the toxicity and identity labels. Raters were asked to categorize comments as one of explicitly, implicitly, not, or not sure if offensive, as well as whether it contained different types of covert offensiveness. The full annotation procedure is detailed in a forthcoming paper at https://sites.google.com/corp/view/hciandnlp/accepted-papers
Dataset size:
97.83 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
2,455 |
'train' |
48,074 |
- Feature structure:
FeaturesDict({
'article_id': int32,
'asian': float32,
'atheist': float32,
'bisexual': float32,
'black': float32,
'buddhist': float32,
'christian': float32,
'covert_emoticons_emojis': float32,
'covert_humor': float32,
'covert_masked_harm': float32,
'covert_microaggression': float32,
'covert_obfuscation': float32,
'covert_political': float32,
'covert_sarcasm': float32,
'created_date': string,
'explicitly_offensive': float32,
'female': float32,
'heterosexual': float32,
'hindu': float32,
'homosexual_gay_or_lesbian': float32,
'id': string,
'identity_attack': float32,
'implicitly_offensive': float32,
'insult': float32,
'intellectual_or_learning_disability': float32,
'jewish': float32,
'latino': float32,
'male': float32,
'muslim': float32,
'not_offensive': float32,
'not_sure_offensive': float32,
'obscene': float32,
'other_disability': float32,
'other_gender': float32,
'other_race_or_ethnicity': float32,
'other_religion': float32,
'other_sexual_orientation': float32,
'parent_id': int32,
'parent_text': Text(shape=(), dtype=string),
'physical_disability': float32,
'psychiatric_or_mental_illness': float32,
'publication_id': string,
'severe_toxicity': float32,
'sexual_explicit': float32,
'text': Text(shape=(), dtype=string),
'threat': float32,
'toxicity': float32,
'transgender': float32,
'white': float32,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
article_id | Tensor | int32 | ||
asian | Tensor | float32 | ||
atheist | Tensor | float32 | ||
bisexual | Tensor | float32 | ||
black | Tensor | float32 | ||
buddhist | Tensor | float32 | ||
christian | Tensor | float32 | ||
covert_emoticons_emojis | Tensor | float32 | ||
covert_humor | Tensor | float32 | ||
covert_masked_harm | Tensor | float32 | ||
covert_microaggression | Tensor | float32 | ||
covert_obfuscation | Tensor | float32 | ||
covert_political | Tensor | float32 | ||
covert_sarcasm | Tensor | float32 | ||
created_date | Tensor | string | ||
explicitly_offensive | Tensor | float32 | ||
female | Tensor | float32 | ||
heterosexual | Tensor | float32 | ||
hindu | Tensor | float32 | ||
homosexual_gay_or_lesbian | Tensor | float32 | ||
id | Tensor | string | ||
identity_attack | Tensor | float32 | ||
implicitly_offensive | Tensor | float32 | ||
insult | Tensor | float32 | ||
intellectual_or_learning_disability | Tensor | float32 | ||
jewish | Tensor | float32 | ||
latino | Tensor | float32 | ||
male | Tensor | float32 | ||
muslim | Tensor | float32 | ||
not_offensive | Tensor | float32 | ||
not_sure_offensive | Tensor | float32 | ||
obscene | Tensor | float32 | ||
other_disability | Tensor | float32 | ||
other_gender | Tensor | float32 | ||
other_race_or_ethnicity | Tensor | float32 | ||
other_religion | Tensor | float32 | ||
other_sexual_orientation | Tensor | float32 | ||
parent_id | Tensor | int32 | ||
parent_text | Text | string | ||
physical_disability | Tensor | float32 | ||
psychiatric_or_mental_illness | Tensor | float32 | ||
publication_id | Tensor | string | ||
severe_toxicity | Tensor | float32 | ||
sexual_explicit | Tensor | float32 | ||
text | Text | string | ||
threat | Tensor | float32 | ||
toxicity | Tensor | float32 | ||
transgender | Tensor | float32 | ||
white | Tensor | float32 |
Supervised keys (See
as_supervised
doc):('text', 'toxicity')
Examples (tfds.as_dataframe):
- Citation:
@inproceedings{lees-etal-2021-capturing,
title = "Capturing Covertly Toxic Speech via Crowdsourcing",
author = "Lees, Alyssa and
Borkan, Daniel and
Kivlichan, Ian and
Nario, Jorge and
Goyal, Tesh",
booktitle = "Proceedings of the First Workshop on Bridging Human{--}Computer Interaction and Natural Language Processing",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.hcinlp-1.3",
pages = "14--20"
}
civil_comments/CivilCommentsToxicSpans
Config description: The CivilComments Toxic Spans are a subset of CivilComments that is labeled at the span level - the indices of all character (unicode codepoints) boundaries that were tagged as toxic by a majority of the annotators is returned in a 'spans' feature.
Dataset size:
5.81 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
2,000 |
'train' |
7,939 |
'validation' |
682 |
- Feature structure:
FeaturesDict({
'article_id': int32,
'created_date': string,
'id': string,
'parent_id': int32,
'parent_text': Text(shape=(), dtype=string),
'publication_id': string,
'spans': Tensor(shape=(None,), dtype=int32),
'text': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
article_id | Tensor | int32 | ||
created_date | Tensor | string | ||
id | Tensor | string | ||
parent_id | Tensor | int32 | ||
parent_text | Text | string | ||
publication_id | Tensor | string | ||
spans | Tensor | (None,) | int32 | |
text | Text | string |
Supervised keys (See
as_supervised
doc):('text', 'spans')
Examples (tfds.as_dataframe):
- Citation:
@inproceedings{pavlopoulos-etal-2021-semeval,
title = "{S}em{E}val-2021 Task 5: Toxic Spans Detection",
author = "Pavlopoulos, John and Sorensen, Jeffrey and Laugier, L{'e}o and Androutsopoulos, Ion",
booktitle = "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.semeval-1.6",
doi = "10.18653/v1/2021.semeval-1.6",
pages = "59--69",
}
civil_comments/CivilCommentsInContext
Config description: The CivilComments in Context is a subset of CivilComments that was labeled by making available to the labelers the parent_text. It includes a contextual_toxicity feature.
Dataset size:
9.63 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'train' |
9,969 |
- Feature structure:
FeaturesDict({
'article_id': int32,
'contextual_toxicity': float32,
'created_date': string,
'id': string,
'identity_attack': float32,
'insult': float32,
'obscene': float32,
'parent_id': int32,
'parent_text': Text(shape=(), dtype=string),
'publication_id': string,
'severe_toxicity': float32,
'sexual_explicit': float32,
'text': Text(shape=(), dtype=string),
'threat': float32,
'toxicity': float32,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
article_id | Tensor | int32 | ||
contextual_toxicity | Tensor | float32 | ||
created_date | Tensor | string | ||
id | Tensor | string | ||
identity_attack | Tensor | float32 | ||
insult | Tensor | float32 | ||
obscene | Tensor | float32 | ||
parent_id | Tensor | int32 | ||
parent_text | Text | string | ||
publication_id | Tensor | string | ||
severe_toxicity | Tensor | float32 | ||
sexual_explicit | Tensor | float32 | ||
text | Text | string | ||
threat | Tensor | float32 | ||
toxicity | Tensor | float32 |
Supervised keys (See
as_supervised
doc):('text', 'toxicity')
Examples (tfds.as_dataframe):
- Citation:
@misc{pavlopoulos2020toxicity,
title={Toxicity Detection: Does Context Really Matter?},
author={John Pavlopoulos and Jeffrey Sorensen and Lucas Dixon and Nithum Thain and Ion Androutsopoulos},
year={2020}, eprint={2006.00998}, archivePrefix={arXiv}, primaryClass={cs.CL}
}