يدعم TFDS الآن تنسيق الكرواسون 🥐 ! اقرأ الوثائق لمعرفة المزيد.

تمت ترجمة هذه الصفحة بواسطة Cloud Translation API‏.

مرات

مراجع:

بي جي بكالوريوس

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-bs')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	136009

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "bs"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-el

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-el')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	212437

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "el"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس إل

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-el')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	137602

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "el"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-en

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-en')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	213160

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس أون

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-en')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	138387

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

إل إن

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/el-en')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	227168

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-hr

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-hr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	203465

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس-ساعة

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-hr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	138402

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

الساعة

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/el-hr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	205008

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

EN-HR

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/en-hr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	205910

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-mk

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-mk')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	207169

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس-عضو الكنيست

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-mk')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	132779

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

El-Mk

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/el-mk')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	207262

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-mk

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/en-mk')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	207777

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ساعة-عضو الكنيست

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/hr-mk')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	198876

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-ro

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-ro')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	210842

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس ريال عماني

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-ro')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	137365

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

إلرو

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/el-ro')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	212359

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-ro

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/en-ro')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	213047

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-ro

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/hr-ro')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	203777

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

عضو الكنيست رو

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/mk-ro')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	206168

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-sq

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-sq')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	211518

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس مربع

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-sq')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	137953

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

الميدان

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/el-sq')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	226577

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

أون مربع

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/en-sq')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	227516

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ساعة مربعة

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/hr-sq')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	205044

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

عضو الكنيست مربع

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/mk-sq')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	206601

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ريال عماني مربع

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/ro-sq')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	212320

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

حرس الحدود ريال

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	211172

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس ريال

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	135945

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

السيد

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/el-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	224311

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ar-sr

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/en-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	225169

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ساعة ريال

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/hr-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	203989

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

عضو الكنيست ريال

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/mk-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	207295

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ريال عماني

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/ro-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	210612

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

مربع ريال

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/sq-sr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	224595

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sq",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-tr

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bg-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	206071

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

بكالوريوس آر

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/bs-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	133958

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

إل آر

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/el-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	207029

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

أون آر

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/en-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	207678

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ساعة-tr

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/hr-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	199260

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

عضو الكنيست آر

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/mk-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	203231

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

رو-tr

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/ro-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	206104

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

مربع آر

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/sq-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	207107

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sq",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ريال-TR

استخدم الأمر التالي لتحميل مجموعة البيانات هذه في TFDS:

ds = tfds.load('huggingface:setimes/sr-tr')

وصف :

SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

الترخيص : لا يوجد ترخيص معروف
الإصدار : 1.0.0
الإنشقاقات :

ينقسم	أمثلة
`'train'`	205993

سمات :

{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sr",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}