open_subtitles

参考文献:

bs-eo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:open_subtitles/bs-eo')
  • 説明
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • ライセンス: 既知のライセンスはありません
  • バージョン: 2018.0.0
  • 分割:
スプリット
'train' 10989
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "bs": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "eo": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "bs": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "eo": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "bs",
            "eo"
        ],
        "id": null,
        "_type": "Translation"
    }
}

やあ

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:open_subtitles/fr-hy')
  • 説明
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • ライセンス: 既知のライセンスはありません
  • バージョン: 2018.0.0
  • 分割:
スプリット
'train' 668
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "fr": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "hy": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "fr": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "hy": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "fr",
            "hy"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ダル

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:open_subtitles/da-ru')
  • 説明
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • ライセンス: 既知のライセンスはありません
  • バージョン: 2018.0.0
  • 分割:
スプリット
'train' 7543012
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "da": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "ru": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "da": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "ru": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "da",
            "ru"
        ],
        "id": null,
        "_type": "Translation"
    }
}

エンヒ

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:open_subtitles/en-hi')
  • 説明
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • ライセンス: 既知のライセンスはありません
  • バージョン: 2018.0.0
  • 分割:
スプリット
'train' 93016
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "en": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "hi": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "en": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "hi": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "en",
            "hi"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bn-is

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:open_subtitles/bn-is')
  • 説明
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • ライセンス: 既知のライセンスはありません
  • バージョン: 2018.0.0
  • 分割:
スプリット
'train' 38272
  • 特徴
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "bn": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "is": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "bn": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "is": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "bn",
            "is"
        ],
        "id": null,
        "_type": "Translation"
    }
}