TFDS теперь поддерживает формат Croissant 🥐 ! Прочтите документацию , чтобы узнать больше.

Эта страница переведена с помощью Cloud Translation API.

пара_пат

Использованная литература:

эль-эн

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/el-en')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	10855

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

cs-en

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/cs-en')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	78977

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "cs",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

Энь-ху

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-hu')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	42629

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "hu"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ан-ро

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-ro')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	48789

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

эн-ск

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-sk')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	23410

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

англ-британский

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-uk')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	89226

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "uk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

вс-фр

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/es-fr')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	32553

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "es",
            "fr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

фр-ру

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/fr-ru')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	10889

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "fr",
            "ru"
        ],
        "id": null,
        "_type": "Translation"
    }
}

де-фр

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/de-fr')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	1167988

Функции :

{
    "translation": {
        "languages": [
            "de",
            "fr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

эн-джа

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-ja')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	6170339

Функции :

{
    "translation": {
        "languages": [
            "en",
            "ja"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ru-es

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-es')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	649396

Функции :

{
    "translation": {
        "languages": [
            "en",
            "es"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ан-фр

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-fr')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	12223525

Функции :

{
    "translation": {
        "languages": [
            "en",
            "fr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

де-эн

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/de-en')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	2165054

Функции :

{
    "translation": {
        "languages": [
            "de",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

эн-ко

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-ko')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	2324357

Функции :

{
    "translation": {
        "languages": [
            "en",
            "ko"
        ],
        "id": null,
        "_type": "Translation"
    }
}

фр-я

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/fr-ja')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	313422

Функции :

{
    "translation": {
        "languages": [
            "fr",
            "ja"
        ],
        "id": null,
        "_type": "Translation"
    }
}

эн-ж

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-zh')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	4897841

Функции :

{
    "translation": {
        "languages": [
            "en",
            "zh"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-ru

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-ru')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	4296399

Функции :

{
    "translation": {
        "languages": [
            "en",
            "ru"
        ],
        "id": null,
        "_type": "Translation"
    }
}

фр-ко

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/fr-ko')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	120607

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "fr",
            "ko"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ру-ук

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/ru-uk')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	85963

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ru",
            "uk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

эн-пт

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:para_pat/en-pt')

Описание :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Лицензия : CC BY 4.0.
Версия : 1.1.0
Расколы :

Расколоть	Примеры
`'train'`	23121

Функции :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "pt"
        ],
        "id": null,
        "_type": "Translation"
    }
}