open_subtitles

Références:

bs-eo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:open_subtitles/bs-eo')
  • Descriptif :
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • Licence : Aucune licence connue
  • Version : 2018.0.0
  • Fractionnements :
Diviser Exemples
'train' 10989
  • Caractéristiques :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "bs": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "eo": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "bs": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "eo": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "bs",
            "eo"
        ],
        "id": null,
        "_type": "Translation"
    }
}

fr-hy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:open_subtitles/fr-hy')
  • Descriptif :
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • Licence : Aucune licence connue
  • Version : 2018.0.0
  • Fractionnements :
Diviser Exemples
'train' 668
  • Caractéristiques :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "fr": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "hy": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "fr": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "hy": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "fr",
            "hy"
        ],
        "id": null,
        "_type": "Translation"
    }
}

da-ru

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:open_subtitles/da-ru')
  • Descriptif :
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • Licence : Aucune licence connue
  • Version : 2018.0.0
  • Fractionnements :
Diviser Exemples
'train' 7543012
  • Caractéristiques :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "da": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "ru": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "da": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "ru": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "da",
            "ru"
        ],
        "id": null,
        "_type": "Translation"
    }
}

fr-salut

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:open_subtitles/en-hi')
  • Descriptif :
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • Licence : Aucune licence connue
  • Version : 2018.0.0
  • Fractionnements :
Diviser Exemples
'train' 93016
  • Caractéristiques :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "en": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "hi": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "en": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "hi": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "en",
            "hi"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bn-est

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:open_subtitles/bn-is')
  • Descriptif :
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

Important: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts
total number of files: 3,735,070
total number of tokens: 22.10G
total number of sentence fragments: 3.35G
  • Licence : Aucune licence connue
  • Version : 2018.0.0
  • Fractionnements :
Diviser Exemples
'train' 38272
  • Caractéristiques :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "meta": {
        "year": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "imdbId": {
            "dtype": "uint32",
            "id": null,
            "_type": "Value"
        },
        "subtitleId": {
            "bn": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            },
            "is": {
                "dtype": "uint32",
                "id": null,
                "_type": "Value"
            }
        },
        "sentenceIds": {
            "bn": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            },
            "is": {
                "feature": {
                    "dtype": "uint32",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        }
    },
    "translation": {
        "languages": [
            "bn",
            "is"
        ],
        "id": null,
        "_type": "Translation"
    }
}