Riferimenti:
tutte_lingue
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/all_languages')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 1926192 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ef
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/af')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 307 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ar
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/ar')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 6446 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
az
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/az')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 624 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Essere
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/be')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 1512 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ber
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/ber')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 67484 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
bg
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/bg')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 6324 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
miliardo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/bn')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 1440 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
fratello
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/br')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 2536 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ca
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/ca')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 518 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
cbk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/cbk')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 262 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
cmq
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/cmn')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 12549 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
c.s
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/cs')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 6659 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
da
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/da')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 11220 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
de
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/de')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 125091 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
el
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/el')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 10072 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
en
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/en')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 158053 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
eo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/eo')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 207105 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
es
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/es')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 85064 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
et
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/et')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 241 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Unione Europea
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/eu')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 573 |
- Caratteristiche :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
fi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:tapaco/fi')
- Descrizione :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licenza : Creative Commons Attribuzione 2.0 Generica
- Versione : 1.0.0
- Divide :
Diviso | Esempi |
---|---|
'train' | 31753 |
- Caratteristiche :
{
"paraphrase_set_id":