Referências:
neve_t15
Use o seguinte comando para carregar este conjunto de dados no TFDS:
ds = tfds.load('huggingface:snow_simplified_japanese_corpus/snow_t15')
- Descrição :
About SNOW T15: The simplified corpus for the Japanese language. The corpus has 50,000 manually simplified and aligned sentences. This corpus contains the original sentences, simplified sentences and English translation of the original sentences. It can be used for automatic text simplification as well as translating simple Japanese into English and vice-versa. The core vocabulary is restricted to 2,000 words where it is selected by accounting for several factors such as meaning preservation, variation, simplicity and the UniDic word segmentation criterion.
For details, refer to the explanation page of Japanese simplification (http://www.jnlp.org/research/Japanese_simplification). The original texts are from "small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods", which is a bilingual corpus for machine translation. About SNOW T23: An expansion corpus of 35,000 sentences rewritten in easy Japanese (simple Japanese vocabulary) based on SNOW T15. The original texts are from "Tanaka Corpus" (http://www.edrdg.org/wiki/index.php/Tanaka_Corpus).
- Licença : CC BY 4.0
- Versão : 1.1.0
- Divisões :
Dividir | Exemplos |
---|---|
'train' | 50.000 |
- Características :
{
"ID": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"original_ja": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"simplified_ja": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"original_en": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
neve_t23
Use o seguinte comando para carregar este conjunto de dados no TFDS:
ds = tfds.load('huggingface:snow_simplified_japanese_corpus/snow_t23')
- Descrição :
About SNOW T15: The simplified corpus for the Japanese language. The corpus has 50,000 manually simplified and aligned sentences. This corpus contains the original sentences, simplified sentences and English translation of the original sentences. It can be used for automatic text simplification as well as translating simple Japanese into English and vice-versa. The core vocabulary is restricted to 2,000 words where it is selected by accounting for several factors such as meaning preservation, variation, simplicity and the UniDic word segmentation criterion.
For details, refer to the explanation page of Japanese simplification (http://www.jnlp.org/research/Japanese_simplification). The original texts are from "small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods", which is a bilingual corpus for machine translation. About SNOW T23: An expansion corpus of 35,000 sentences rewritten in easy Japanese (simple Japanese vocabulary) based on SNOW T15. The original texts are from "Tanaka Corpus" (http://www.edrdg.org/wiki/index.php/Tanaka_Corpus).
- Licença : CC BY 4.0
- Versão : 1.1.0
- Divisões :
Dividir | Exemplos |
---|---|
'train' | 34300 |
- Características :
{
"ID": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"original_ja": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"simplified_ja": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"original_en": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"proper_noun": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}