オスカー

参考文献:

unshuffled_deduplicated_af

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 130640
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_als

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4518
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_arz

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 79928
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_an

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2025年
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ast

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5343
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ba

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 27050
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_am

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 43102
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_as

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9212
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_azb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9985
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_be

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 307405
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 15762
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bxr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 36
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ceb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 26145
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_az

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 626796
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bcl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 98225
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dsb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 37
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1114481
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bs

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 702
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ce

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2984
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 10130
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_diq

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eml

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 80
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_et

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1172041
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bg

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3398679
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bpy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1770年
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ca

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2458067
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ckb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 68210
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ar

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。

    私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9006977
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_av

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。

    当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。

    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 360
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bar

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bh

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 82
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_br

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 14724
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cbk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_da

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4771098
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 17024
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 84752
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 8203495
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 20661
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 68
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cs

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 12308039
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1909387
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6582908
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ie

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 11
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 59448891
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gd

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3883
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 169834
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hsb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3084
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ia

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 529
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_io

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 617
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jbo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 617
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_km

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 108346
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ku

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 29054
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_la

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 18808
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lmo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1374年
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 843195
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_min

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 166
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 212556
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mwl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。

    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。

    コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nah

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 58
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2126
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6485
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 67921
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 28522082
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 372158
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5044757
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 17
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3675420
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 68
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1381
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 72
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 13343
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 453904
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 183443
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 8714
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 109118
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2559
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2859
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 411
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7121
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2820821
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 17610
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 42
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 645747
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 833101
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4694
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 24
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 15074
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 677
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2418
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 11014487
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 56259
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 62398034
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 11596446
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6521169
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7782375
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9897709
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 64
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 49
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7324
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 158113
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 912330
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1675515
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2143
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4042
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 20281
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 84
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2093621
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 41708901
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2449
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6999
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 42551
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5869686
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6046
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4390754
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 103639
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 56326016
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7664010
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 21018
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 121168
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5326443
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 46493
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 484
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 321484
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 396093
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1578
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 13704702
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 33053
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 106
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3264660
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 11197780
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 101
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 39496439
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 338073
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1377
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 86561
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 118
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1737411
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2515
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 197878
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 16383
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 917
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 219334
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3229940
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 87235
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3463
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 34
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 8555
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 120684
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 461598
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 24803
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3749826
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 82738
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 428674
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3317
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 36
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 83663
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 14985
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 15446
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 586031
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 26795
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 42
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 56248
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 157698
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 65
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 96742378
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5799
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 240691
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7959
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1040
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 694
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 832
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 159363
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 46535
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 94588
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1401
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1593820
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 220
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 326804
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 8
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 61
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4696
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 10709
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 98216
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9387265
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 21
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5492194
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1013619
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1263280
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6456
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 34
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 27537
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1001
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3783
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 46981781
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 563916
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7345075
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 203
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1485
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 88
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 17957
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 603937
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 534016
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 18174
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 185884
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5213
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3225
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 452
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 14291
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 36700
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 156
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 17395625
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 89002
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 18535253
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 12973467
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 14898250
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 214
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 214
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 60137667
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 304230423
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 256513
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 284320
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2375030
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9948521
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 389515
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1163
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 251064
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 924
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 21735
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 32652
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 25
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 299457
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 669
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 136639
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 55
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 20812149
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 44230
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 20682611
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 26920397
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 115954598
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 33925
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 886223
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 511
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 312644
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 294132
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 15503
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 64
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9161
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 32919
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 201117
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 16365602
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 456
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 336
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 37085
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 21001388
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 104913504
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 10425596
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 88199221
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 8557453
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 83223
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 640
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 582219
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 659430
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2638
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 62721527
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 524591
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1581年
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 146993
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 137
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 2977757
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3212
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 395605
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 26598
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1055
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 299938
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 5546211
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 127467
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 4599
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 41
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 22301
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 203082
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 672077
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 41986
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 6064129
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 135923
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 638596
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3366
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 39
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 11
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 455994980
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 506883
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 7
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 544388
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 3808397
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 13
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 16236463
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 625673
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1445年
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 350363
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1549年
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 34807
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 52910
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 123
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 437871
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 757
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 232329
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 73
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 34682142
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 59463
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 35440972
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 42114520
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 161836003
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 44280
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 1746604
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 805
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 475703
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 458206
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 22255
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 73
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 9760
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

スプリット
'train' 59364
  • 特徴
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}