snow_simplified_japanese_corpus

Les références:

neige_t15

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:snow_simplified_japanese_corpus/snow_t15')
  • Description :
About SNOW T15: The simplified corpus for the Japanese language. The corpus has 50,000 manually simplified and aligned sentences. This corpus contains the original sentences, simplified sentences and English translation of the original sentences. It can be used for automatic text simplification as well as translating simple Japanese into English and vice-versa. The core vocabulary is restricted to 2,000 words where it is selected by accounting for several factors such as meaning preservation, variation, simplicity and the UniDic word segmentation criterion.
For details, refer to the explanation page of Japanese simplification (http://www.jnlp.org/research/Japanese_simplification). The original texts are from "small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods", which is a bilingual corpus for machine translation. About SNOW T23: An expansion corpus of 35,000 sentences rewritten in easy Japanese (simple Japanese vocabulary) based on SNOW T15. The original texts are from "Tanaka Corpus" (http://www.edrdg.org/wiki/index.php/Tanaka_Corpus).
  • Licence : CC BY 4.0
  • Version : 1.1.0
  • Divisions :
Diviser Exemples
'train' 50000
  • Caractéristiques :
{
    "ID": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "original_ja": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "simplified_ja": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "original_en": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

neige_t23

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:snow_simplified_japanese_corpus/snow_t23')
  • Description :
About SNOW T15: The simplified corpus for the Japanese language. The corpus has 50,000 manually simplified and aligned sentences. This corpus contains the original sentences, simplified sentences and English translation of the original sentences. It can be used for automatic text simplification as well as translating simple Japanese into English and vice-versa. The core vocabulary is restricted to 2,000 words where it is selected by accounting for several factors such as meaning preservation, variation, simplicity and the UniDic word segmentation criterion.
For details, refer to the explanation page of Japanese simplification (http://www.jnlp.org/research/Japanese_simplification). The original texts are from "small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods", which is a bilingual corpus for machine translation. About SNOW T23: An expansion corpus of 35,000 sentences rewritten in easy Japanese (simple Japanese vocabulary) based on SNOW T15. The original texts are from "Tanaka Corpus" (http://www.edrdg.org/wiki/index.php/Tanaka_Corpus).
  • Licence : CC BY 4.0
  • Version : 1.1.0
  • Divisions :
Diviser Exemples
'train' 34300
  • Caractéristiques :
{
    "ID": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "original_ja": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "simplified_ja": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "original_en": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "proper_noun": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}