TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

assin2

Description:

Contextualization

ASSIN 2 is the second edition of the Avaliação de Similaridade Semântica e Inferência Textual (Evaluating Semantic Similarity and Textual Entailment), and was a workshop collocated with STIL 2019. It follows the first edition of ASSIN, proposing a new shared task with new data.

The workshop evaluated systems that assess two types of relations between two sentences: Semantic Textual Similarity and Textual Entailment.

Semantic Textual Similarity consists of quantifying the level of semantic equivalence between sentences, while Textual Entailment Recognition consists of classifying whether the first sentence entails the second.

Data

The corpus used in ASSIN 2 is composed of rather simple sentences. Following the procedures of SemEval 2014 Task 1, we tried to remove from the corpus named entities and indirect speech, and tried to have all verbs in the present tense. The annotation instructions given to annotators are available (in Portuguese).

The training and validation data are composed, respectively, of 6,500 and 500 sentence pairs in Brazilian Portuguese, annotated for entailment and semantic similarity. Semantic similarity values range from 1 to 5, and text entailment classes are either entailment or none. The test data are composed of approximately 3,000 sentence pairs with the same annotation. All data were manually annotated.

Evaluation

Evaluation The evaluation of submissions to ASSIN 2 was with the same metrics as the first ASSIN, with the F1 of precision and recall as the main metric for text entailment and Pearson correlation for semantic similarity. The evaluation scripts are the same as in the last edition.

PS.: Description is extracted from official homepage.

Additional Documentation: Explore on Papers With Code
Homepage: https://sites.google.com/view/assin2/english
Source code: tfds.datasets.assin2.Builder
Versions:
- 1.0.0 (default): Initial release.
Download size: 2.02 MiB
Dataset size: 1.82 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'test'`	2,448
`'train'`	6,500
`'validation'`	500

Feature structure:

FeaturesDict({
    'entailment': ClassLabel(shape=(), dtype=int64, num_classes=2),
    'hypothesis': Text(shape=(), dtype=string),
    'id': int32,
    'similarity': float32,
    'text': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
entailment	ClassLabel	int64
hypothesis	Text	string
id	Tensor	int32
similarity	Tensor	float32
text	Text	string

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@inproceedings{DBLP:conf/propor/RealFO20,
  author    = {Livy Real and
               Erick Fonseca and
               Hugo Gon{\c{c} }alo Oliveira},
  editor    = {Paulo Quaresma and
               Renata Vieira and
               Sandra M. Alu{\'{\i} }sio and
               Helena Moniz and
               Fernando Batista and
               Teresa Gon{\c{c} }alves},
  title     = {The {ASSIN} 2 Shared Task: {A} Quick Overview},
  booktitle = {Computational Processing of the Portuguese Language - 14th International
               Conference, {PROPOR} 2020, Evora, Portugal, March 2-4, 2020, Proceedings},
  series    = {Lecture Notes in Computer Science},
  volume    = {12037},
  pages     = {406--412},
  publisher = {Springer},
  year      = {2020},
  url       = {https://doi.org/10.1007/978-3-030-41505-1_39},
  doi       = {10.1007/978-3-030-41505-1_39},
  timestamp = {Tue, 03 Mar 2020 09:40:18 +0100},
  biburl    = {https://dblp.org/rec/conf/propor/RealFO20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}