


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 130640
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 4518
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 79928
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 2025年
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 5343
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 27050
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 43102
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 9212
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 9985
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 307405
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 15762
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 36
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 26145
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 626796
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 98225
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 37
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1114481
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 702
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 2984
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 10130
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 80
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1172041
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 3398679
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1770年
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 2458067
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 68210
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 9006977
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。


    • 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
    • 侵害されていると主張される著作物を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 360
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 4
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 82
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 14724
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 4771098
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 17024
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 84752
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 8203495
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 20661
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 68
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 12308039
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1909387
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 6582908
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 11
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 59448891
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 3883
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 169834
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 3084
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 529
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 617
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 617
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 108346
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 29054
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 18808
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 1374年
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 843195
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 166
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 212556
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。


    • アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
    • 侵害されていると主張されている著作権で保護された作業を明確に特定します。
    • 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。


  • バージョン: 1.0.0

  • 分割:

'train' 7
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは​​、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 58
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2126
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 6485
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 67921
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 28522082
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 372158
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5044757
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 17
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3675420
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 68
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1381
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 72
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 13343
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 453904
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 183443
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 8714
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 109118
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2559
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2859
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 411
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7121
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2820821
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 17610
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 42
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 645747
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 833101
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 4694
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 24
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 15074
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 677
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2418
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 11014487
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 56259
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 62398034
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 11596446
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 6521169
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7782375
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 9897709
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 64
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 49
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7324
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 158113
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 912330
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1675515
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2143
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 4042
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 20281
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 84
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2093621
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 41708901
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2449
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 6999
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 42551
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5869686
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 6046
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 4390754
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 103639
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 56326016
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7664010
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 21018
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 121168
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5326443
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 46493
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 484
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 321484
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 396093
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1578
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 13704702
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 33053
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 106
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3264660
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 11197780
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 101
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 39496439
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 338073
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1377
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 86561
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 118
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1737411
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2515
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 197878
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 16383
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 917
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 219334
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3229940
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 87235
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3463
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 34
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 8555
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 120684
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 461598
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 24803
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3749826
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 82738
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 428674
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3317
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 36
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 83663
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 14985
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 15446
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 586031
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 26795
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 42
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 56248
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 157698
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 65
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 96742378
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5799
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 240691
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7959
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1040
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 694
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 832
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 159363
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 46535
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 94588
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1401
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1593820
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 220
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 326804
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 8
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 61
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 4696
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 10709
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 98216
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 9387265
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 21
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5492194
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1013619
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1263280
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 6456
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 34
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 27537
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1001
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3783
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 46981781
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 563916
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7345075
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 203
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1485
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 88
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 17957
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 603937
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 534016
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 6
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 18174
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 185884
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5213
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3225
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 452
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 14291
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 36700
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 156
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 17395625
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 89002
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 18535253
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 12973467
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 14898250
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 214
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 214
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 60137667
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 304230423
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 256513
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 284320
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2375030
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 9
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 9948521
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 389515
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1163
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 251064
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 924
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 21735
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 32652
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 25
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 299457
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 669
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 136639
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 55
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 20812149
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 44230
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 20682611
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 26920397
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 115954598
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 33925
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 886223
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 511
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 312644
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 294132
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 15503
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 64
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 9161
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 32919
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 201117
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 16365602
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 456
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 4
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 336
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 37085
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 21001388
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 104913504
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 10425596
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 88199221
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 8557453
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 83223
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 640
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 582219
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 659430
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2638
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 62721527
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 524591
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1581年
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 146993
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 137
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 2977757
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3212
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 395605
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 26598
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1055
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 299938
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 5546211
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 127467
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 4599
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 41
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 22301
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 203082
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 672077
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 41986
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 6064129
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 135923
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 638596
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3366
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 39
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 11
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 455994980
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 506883
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 7
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 544388
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 3808397
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 13
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 16236463
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 625673
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1445年
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 350363
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1549年
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 34807
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 52910
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 123
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 437871
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 757
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 232329
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 73
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 34682142
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 59463
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 35440972
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 42114520
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 161836003
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 44280
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 1746604
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 805
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 475703
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 458206
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 22255
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 73
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 9760
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


次のコマンドを使用して、このデータセットを TFDS にロードします。

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • 説明
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • バージョン: 1.0.0

  • 分割:

'train' 59364
  • 特徴
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"