参考文献:
unshuffled_deduplicated_af
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 130640 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_als
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4518 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_arz
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 79928 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_an
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2025年 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ast
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5343 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ba
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 27050 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_am
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 43102 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_as
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9212 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_azb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9985 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_be
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 307405 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 15762 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bxr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 36 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ceb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 26145 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_az
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 626796 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bcl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 98225 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dsb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 37 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1114481 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bs
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 702 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ce
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2984 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 10130 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_diq
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eml
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 80 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_et
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1172041 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bg
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3398679 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bpy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1770年 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ca
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2458067 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ckb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 68210 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ar
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害していると主張される素材と、その素材を特定するために合理的に十分な情報を明確に特定します。
私たちは、正当な要求に応じて、コーパスの次のリリースから影響を受けるソースを削除します。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9006977 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_av
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス: これらのデータは、このライセンス スキームに基づいてリリースされています。当社は、これらのデータが抽出されたテキストを所有しません。当社は、これらのデータの実際のパッケージ化をクリエイティブ コモンズ CC0 ライセンス (「権利留保なし」) http://creativecommons.org/publicdomain/zero/1.0/に基づいてライセンス供与しています。法律で可能な範囲で、Inria はすべての著作権および関連する著作権を放棄しています。 OSCAR に対する著作隣接権 この作品はフランスから発行されています。
当社のデータにはお客様が所有する素材が含まれているため、ここで複製すべきではないと考えられる場合は、次のことを行ってください。
- 連絡先の住所、電話番号、電子メール アドレスなどの詳細な連絡先データを使用して、自分自身を明確に特定します。
- 侵害されていると主張される著作物を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 360 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bar
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bh
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 82 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_br
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 14724 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cbk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_da
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4771098 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 17024 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 84752 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 8203495 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 20661 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 68 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cs
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 12308039 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1909387 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6582908 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ie
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 11 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 59448891 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gd
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3883 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 169834 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hsb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3084 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ia
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 529 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_io
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 617 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jbo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 617 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_km
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 108346 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ku
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 29054 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_la
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 18808 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lmo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1374年 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 843195 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_min
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 166 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 212556 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mwl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
私たちのデータにはあなたが所有している資料が含まれているため、ここで再現すべきではないと考える場合は、次のことをお願いします。
- アドレス、電話番号、電子メールアドレスなどの詳細な連絡先データを使用して、連絡先などの詳細な連絡先データを明確に識別します。
- 侵害されていると主張されている著作権で保護された作業を明確に特定します。
- 侵害されていると主張されている材料と、私たちが材料を見つけることができるように合理的に十分な情報を明確に特定します。
コーパスの次のリリースから影響を受けるソースを削除することにより、合法的な要求に従います。
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nah
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
ライセンス:これらのデータは、このライセンススキームでリリースされ、これらのデータが抽出されたテキストを所有していません。これらのデータの実際のパッケージングは、Creative Commons CC0ライセンス(「権利を収納していない」) http://creativecommons.org/publicdomain/zero/1.0/で可能な限り、すべての著作権および関連または関連または関連または関連するかを放棄しましたオスカーの近隣の権利この作品は、フランスから出版されています。
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 58 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2126 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6485 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 67921 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 28522082 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 372158 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5044757 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 17 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3675420 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 68 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1381 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 72 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 13343 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 453904 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 183443 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 8714 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 109118 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2559 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2859 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 411 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7121 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2820821 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 17610 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 42 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 645747 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 833101 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4694 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 24 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 15074 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 677 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2418 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 11014487 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 56259 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 62398034 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 11596446 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6521169 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7782375 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9897709 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 64 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 49 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7324 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 158113 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 912330 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1675515 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2143 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4042 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 20281 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 84 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2093621 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 41708901 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2449 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6999 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 42551 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5869686 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6046 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4390754 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 103639 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 56326016 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7664010 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 21018 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 121168 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5326443 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 46493 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 484 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 321484 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 396093 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1578 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 13704702 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 33053 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 106 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3264660 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 11197780 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 101 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 39496439 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 338073 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1377 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 86561 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 118 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1737411 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2515 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 197878 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 16383 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 917 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 219334 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3229940 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 87235 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3463 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 34 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 8555 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 120684 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 461598 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 24803 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3749826 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 82738 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 428674 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3317 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 36 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 83663 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 14985 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 15446 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 586031 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 26795 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 42 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 56248 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 157698 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 65 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 96742378 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5799 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 240691 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7959 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1040 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 694 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 832 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 159363 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 46535 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 94588 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1401 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1593820 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 220 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 326804 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 8 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 61 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4696 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 10709 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 98216 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9387265 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 21 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5492194 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1013619 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1263280 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6456 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 34 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 27537 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1001 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3783 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 46981781 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 563916 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7345075 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 203 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1485 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 88 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 17957 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 603937 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 534016 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 18174 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 185884 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5213 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3225 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 452 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 14291 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 36700 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 156 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 17395625 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 89002 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 18535253 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 12973467 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 14898250 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wuu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 214 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 214 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_zh
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 60137667 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 304230423 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 256513 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 284320 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2375030 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9948521 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 389515 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1163 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 251064 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 924 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 21735 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 32652 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 25 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 299457 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mrj
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 669 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 136639 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 55 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 20812149 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 44230 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 20682611 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 26920397 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 115954598 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 33925 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 886223 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 511 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_te
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 312644 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 294132 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ug
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 15503 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vec
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 64 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_war
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9161 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 32919 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_af
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 201117 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ar
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 16365602 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_av
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 456 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 336 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 37085 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 21001388 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 104913504 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 10425596 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 88199221 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 8557453 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 83223 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 640 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 582219 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 659430 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2638 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 62721527 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 524591 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1581年 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 146993 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 137 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 2977757 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3212 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 395605 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 26598 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1055 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 299938 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 5546211 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 127467 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pnb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 4599 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 41 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sah
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 22301 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 203082 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sq
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 672077 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 41986 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_th
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 6064129 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 135923 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ur
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 638596 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3366 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xal
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 39 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yue
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 11 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 455994980 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 506883 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 7 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 544388 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 3808397 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 13 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 16236463 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 625673 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1445年 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 350363 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1549年 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 34807 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 52910 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 123 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 437871 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mrj
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 757 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 232329 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 73 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 34682142 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 59463 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 35440972 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 42114520 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 161836003 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 44280 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 1746604 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 805 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_te
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 475703 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 458206 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ug
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 22255 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vec
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 73 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_war
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 9760 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yi
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- 説明:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
バージョン: 1.0.0
分割:
スプリット | 例 |
---|---|
'train' | 59364 |
- 特徴:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}