時々

参考文献:

バックグラウンド

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-bs')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 136009
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "bs"
        ],
        "id": null,
        "_type": "Translation"
    }
}

バックエル

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-el')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 212437
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "el"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bsエル

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-el')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 137602
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "el"
        ],
        "id": null,
        "_type": "Translation"
    }
}

バックグラウンド

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-en')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 213160
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-en

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-en')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 138387
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エルエン

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/el-en')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 227168
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

BG-HR

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-hr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 203465
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-hr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-hr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 138402
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エル時間

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/el-hr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 205008
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

毎時

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/en-hr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 205910
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

BGM

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-mk')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 207169
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-mk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-mk')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 132779
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エル・エムク

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/el-mk')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 207262
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-mk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/en-mk')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 207777
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-mk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/hr-mk')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 198876
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

バックグラウンド

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-ro')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 210842
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-ro

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-ro')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 137365
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エルロ

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/el-ro')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 212359
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エンロ

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/en-ro')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 213047
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-ro

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/hr-ro')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 203777
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-ro

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/mk-ro')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 206168
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

バックグラウンドスクエア

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-sq')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 211518
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-sq

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-sq')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 137953
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エルスクエア

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/el-sq')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 226577
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エンスクエア

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/en-sq')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 227516
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

時平方メートル

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/hr-sq')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 205044
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-sq

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/mk-sq')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 206601
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ロスクエア

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/ro-sq')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 212320
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

BG-SR

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 211172
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 135945
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エルスル

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/el-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 224311
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/en-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 225169
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/hr-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 203989
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/mk-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 207295
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ro-sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/ro-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 210612
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

sq-sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/sq-sr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 224595
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sq",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

バックグラウンド

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bg-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 206071
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-tr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/bs-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 133958
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エル・トレ

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/el-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 207029
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

入口

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/en-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 207678
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-tr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/hr-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 199260
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-tr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/mk-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 203231
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ロートル

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/ro-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 206104
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

平方メートル

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/sq-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 207107
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sq",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

sr-tr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:setimes/sr-tr')
  • 説明
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
  • ライセンス: 不明なライセンス
  • バージョン: 1.0.0
  • 分割:
スプリット
'train' 205993
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sr",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}